jmbreda / Sanity

Filtering of Poison noise on a single-cell RNA-seq UMI count matrix
GNU General Public License v3.0
67 stars 11 forks source link

Identification of genes and cells from .mtx file #23

Open kaushik-roy-physics opened 6 months ago

kaushik-roy-physics commented 6 months ago

I would like to understand how Sanity reads the .mtx file generated by CellRanger and prints the number of rows, genes and cells. For example, I have a matrix.mtx file generated by CellRanger which has 3810 columns (cells) and 36613 rows (features). These numbers match the dimensions in the barcodes.tsv and features.tsv also generated by CellRanger. However, of the 36613 features, 36601 have the annotation 'Gene Expression' which I see from the barcodes.tsv file. The rest 12 are features that are annotated 'Multiplexing Capture' which is expected since we requested for Multiplexing.

When I run Sanity on the .mtx file and specify the paths for the barcodes and features files, I get the following readout in the beginning:

File type : mtx There were 36613 rows There were 30875 genes and 3810 cells

....

How is the number of genes (30875) determined by Sanity? Can you please clarify a little bit? Are there any thresholds or filters that are already provided in the code?

Thanks, Kaushik

jmbreda commented 6 months ago

Hi Kaushik,

Thanks for your interest in Sanity and your issue.

I think that the rows in the feature.tsv file corresponds to the number of genes in the reference transcriptome/GTF file, whereas Sanity returns the number of expressed genes (i.e. genes with at least 1 UMI count in at least 1 cell). This is to avoid storing and printing rows with 0 counts across all cells.

I hope this clarifies the way Sanity reads .mtx files?

Best, Jeremie