sci-dash incorrect information

AgedMordorBlue commented 10 months ago

Hi Job,

I'm getting some weird output in my sci-dash:

The Total Input Reads column adds up to more than the Total input read-pairs
Many cell metrics (mean reads/cell, mean UMI/cell, etc) are about 10 times lower than indicated in the STARsolo summary output.

I went back to the STARsolo summary file, and the values in the sci-dash don't match what is written there. When ooking at these Summary stats, they are much more in line with the sci-dash output of an earlier version of the pipeline. I've added the JSON and the STARsolo Summary.csv content of the same sample below.

Best, Yani

sci-dash JSON: "sample_succes": { "5mm_dsDNAse": { "n_pairs_success": 373470356, "sequencing_saturation": 0.751197, "estimated_cells": 5738, "total_mapped_reads": 131048010, "total_unique_reads": 111547480, "total_multimapped_reads": 19500530, "total_correct_reads_genes": 90306169, "total_exonic_reads": 41117200, "total_intronic_reads": 49189000, "total_intergenic_reads": 40741810, "total_mitochondrial_reads": 0, "total_exonicAS_reads": 2446128, "total_intronicAS_reads": 7610787, "mean_reads_per_cell": 2414, "mean_genes_per_cell": 210, "mean_umis_per_cell": 376

STARsolo Summary: Number of Reads,185672402 Reads With Valid Barcodes,1 Sequencing Saturation,0.751197 Q30 Bases in CB+UMI,1 Q30 Bases in RNA read,0.93507 Reads Mapped to Genome: Unique+Multiple,0.705802 Reads Mapped to Genome: Unique,0.600776 Reads Mapped to GeneFull_Ex50pAS: Unique+Multiple GeneFull_Ex50pAS,0.486374 Reads Mapped to GeneFull_Ex50pAS: Unique GeneFull_Ex50pAS,0.441959 Estimated Number of Cells,5738 Unique Reads in Cells Mapped to GeneFull_Ex50pAS,71273128 Fraction of Unique Reads in Cells,0.868553 Mean Reads per Cell,12421 Median Reads per Cell,9947 UMIs in Cells,17618322 Mean UMI per Cell,3070 Median UMI per Cell,2504 Mean GeneFull_Ex50pAS per Cell,1596 Median GeneFull_Ex50pAS per Cell,1442 Total GeneFull_Ex50pAS Detected,20151

STARsolo summary of prior run (I think the switch from GeneFull to GeneFull_Ex50pAS explains the difference between versions): Number of Reads,190841919 Reads With Valid Barcodes,1 Sequencing Saturation,0.730686 Q30 Bases in CB+UMI,1 Q30 Bases in RNA read,0.934432 Reads Mapped to Genome: Unique+Multiple,0.818547 Reads Mapped to Genome: Unique,0.676697 Reads Mapped to GeneFull: Unique+Multiple GeneFull,0.545113 Reads Mapped to GeneFull: Unique GeneFull,0.483755 Estimated Number of Cells,5758 Unique Reads in Cells Mapped to GeneFull,79965182 Fraction of Unique Reads in Cells,0.866167 Mean Reads per Cell,13887 Median Reads per Cell,11178 UMIs in Cells,21385715 Mean UMI per Cell,3714 Median UMI per Cell,3041 Mean GeneFull per Cell,1792 Median GeneFull per Cell,1629 Total GeneFull Detected,17842

J0bbie commented 10 months ago

Hi Yani,

Good catch! It was indeed generating a mean/sum based on all 'raw' cells / ambient RNA (instead of just the filtered cells). This was throwing the numbers off.

I've fixed this in the latest commit and also made some other small changes to the sci-dash.

Just pull the latest code, delete the sci-dash folder of your run and start the snakemake workflow again. It should re-generate just the sci-dash.

Let me know if this fixed it for you!

Best,

Job

gauravvaidya16 commented 10 months ago

Hi Job,

I am facing a similar issue where both the samples have identical stats on the sci-dash but when you look at the STARsolo summary file for the samples they differ. Also the successful read-pairs for the two samples in total are higher than the total input read pairs

Best, Gaurav

Below are the sci-dash JSON and the StarSolo summaries for each sample:

"sample_succes": { "Pmor_50percPEG": { "n_pairs_success": 362401472, "total_reads": 175944916, "sequencing_saturation": 0.490957, "perc_mapped_reads_genome": 0.560275, "perc_unique_reads_genome_unique": 0.305331, "perc_mapped_reads_gene": 0.126788, "perc_unique_reads_gene_unique": 0.105158, "estimated_cells": 8953, "mean_reads_per_cell": 1458, "mean_umi_per_cell": 729, "mean_genes_per_cell": 560, "total_exonic_reads": 8109995, "total_intronic_reads": 7492202, "total_intergenic_reads": 46385009, "total_mitochondrial_reads": 0, "total_exonicAS_reads": 1492351, "total_intronicAS_reads": 3167266 }

"Pmor": {
  "n_pairs_success": 202924630,
  "total_reads": 175944916,
  "sequencing_saturation": 0.490957,
  "perc_mapped_reads_genome": 0.560275,
  "perc_unique_reads_genome_unique": 0.305331,
  "perc_mapped_reads_gene": 0.126788,
  "perc_unique_reads_gene_unique": 0.105158,
  "estimated_cells": 8953,
  "mean_reads_per_cell": 1458,
  "mean_umi_per_cell": 729,
  "mean_genes_per_cell": 560,
  "total_exonic_reads": 8109995,
  "total_intronic_reads": 7492202,
  "total_intergenic_reads": 46385009,
  "total_mitochondrial_reads": 0,
  "total_exonicAS_reads": 1492351,
  "total_intronicAS_reads": 3167266
}

STARsolo Summary for Pmor_50percPEG

STARsolo Summary for Pmor

J0bbie commented 10 months ago

I think I figured it out, it had to due with similar naming schematics and the regular expression used to retrieve the STARSolo files: https://github.com/odomlab2/sci-rocket/commit/ad29488fd6362a7499691f38df3ff26f8e1e1b15

I.e. Pmor / Pmor_50percPEG were getting the wrong statistic files retrieved due to a wildcard search without the species. Could you try again with the latest code and see if it makes more sense now?

gauravvaidya16 commented 9 months ago

Hi Job,

It did fix most of the stats except the successful read-pairs for the two samples in total being higher than the total input read pairs

J0bbie commented 9 months ago

That indeed sounds a bit fishy. I'll try to check whether I'm counting some reads double somewhere. Are you using hashing-barcodes for these samples by chance?

gauravvaidya16 commented 9 months ago

No the samples are unhashed

odomlab2 / sci-rocket

sci-dash incorrect information #29