parashardhapola / scarf

Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.
http://scarf.readthedocs.io
BSD 3-Clause "New" or "Revised" License
92 stars 12 forks source link

ATAC-seq recognized as RNA-seq? #64

Closed arita-clavy closed 2 years ago

arita-clavy commented 2 years ago

Hi staffs of scarf, I downloaded Cusanovich's 81K sc-ATAC-seq dataset using your fetch_dataset(), but when I was importing the dataset, this happened:

reader = scarf.CrDirReader(temp_path + '/cusanovich_81K_mouse_atacseq',mtx_separator='/t')
reader.assayFeats

RNA
type    Gene Expression
start   0
end 167013
nFeatures   167013

somehow it was auto-recognized as a RNA-seq assay, since file_type is deprecated I cannot either change the assay type or go on my downstream analysis. Any possible solutions?

parashardhapola commented 2 years ago

Hi,

This issue is due to fact that the files in this dataset are deposited in a way that is similar to RNA-Seq data. For ATAC-Seq, Scarf expects a 'peaks.tsv.gz' rather than 'features.tsv.gz'. We will fix this issue in the data repository.

For now, the quick solution for you is to do the following:

 # Please note that the separator is a space and not a tab
reader = scarf.CrDirReader('cusanovich_81K_mouse_atacseq', mtx_separator=' ')

# This is how we can rename the assay
reader.rename_assays({'RNA': 'ATAC'}) 

# The data in this file is prenormalized and is present as float values, hence adjust dtype
writer = scarf.CrToZarr(reader, 'cusanovich_81K.zarr', chunk_size=(1000, 10000), dtype=float)

writer.dump()  # This will take upto 5 mins

# Load the data as a datastore object
ds = scarf.DataStore('cusanovich_81K.zarr', nthreads=4)
ds

# Turn off the default TF-IDF normalization because the data is already normalized (you will need to do this every time the data is loaded)
ds.ATAC.normMethod = scarf.assay.norm_dummy

# Thereafter, you can run a basic pipeline with default parameters like this:
ds.mark_prevalent_peaks()
ds.make_graph(feat_key="prevalent_peaks")
ds.run_umap()
ds.run_leiden_clustering()
ds.plot_layout(layout_key="ATAC_UMAP", color_by="ATAC_leiden_cluster")

Please let me know if this helps

parashardhapola commented 2 years ago

Closing this for now. Please feel free to reopen this.