sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
271 stars 67 forks source link

Analysis of the results generated by zUMIs #258

Closed HealHer closed 2 years ago

HealHer commented 3 years ago

Hello,

Thank you very much for zUMIs. The program is very stable and allows many things. I use zUMIs to align private data in order to get processing leads.

Running zUMIs on private datasets

We tried to run zUMIs on 2 datasets (p1 and p2), coming from 2 very close human patients with the same initial condition. Both run used 10x Chromium v2 sequencing protocol. The program runs (the .yaml files are available at the end of the issue). We use custom STAR indexes and a human Hg38 genome. The indexes are generated using:

STAR  --runMode genomeGenerate  
--runThreadN 12   
--genomeDir indexFolder 
--genomeFastaFiles genome.fa 
--genomeSAindexNbases 14 
--genomeChrBinNbits 18 
--genomeSAsparseD 3 
--limitGenomeGenerateRAM 17179869184 
--sjdbGTFfile genes.gtf

Error sometimes occuring

There is sometimes an error occuring on the Counting phase. Re-running the by changing the which_Stage option to Counting solve the issue. The error is always the same:

[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 2"
[1] "Processing 3122 barcodes in this chunk..."
[1] "Working on barcode chunk 2 out of 2"
[1] "Processing 3435 barcodes in this chunk..."
Error in alldt[[i]][[2]] : subscript out of bounds
Calls: bindList
In addition: Warning message:
In parallel::mclapply(mapList, function(tt) { :
  scheduled core 2 did not deliver a result, all values of the job will be affected
Execution halted

Compute output

The .rds files of the runs p1 and p2 are converted into anndata (.h5df and then .h5ad) by using the DropletUtils.write10xcounts() module.

rds_object <- readRDS(file)
dgecounts_exon <- rds_object$umicount$exon$all
write10xCounts('./exon_output.h5df', dgecounts_exon, overwrite=TRUE)

When we analyze the results by linear regression we obtain a horizontal orthogonal line proving that the genes aligned in run 1 are all different from run 2 (figure here)

We try reproducing these runs using cellranger in the same enviroment using custom indexes generated by the same method. After QC we detect many identical parameters (like number of cells, number of reads or distribution), but however a linear regression places the two runs (p1 and p2) with the same proportions following almost gene(p1) = gene(p2) (figure here)

In light of the .run.yaml files that allowed us to run the alignment and treatement, would it be possible for you to see if I'm forgetting any important options that could cause this?

Let me know if you need more information,

Thank you very much for your time

Alexis

Ressources

p1.run.yaml

Rscript_exec: Rscript
STAR_exec: STAR
barcodes:
  BarcodeBinning: 1
  automatic: true
  barcode_file: null
  barcode_num: null
  barcode_sharing: null
  demultiplex: false
  nReadsperCell: 100
counting_opts:
  Ham_Dist: 0
  downsampling: 0
  intronProb: false
  introns: true
  multi_overlap: false
  primaryHit: true
  strand: 0
  twoPass: true
  velocyto: false
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 20
make_stats: true
mem_limit: 65
num_threads: 12
out_dir: /mnt/vol-extra/workspace/alignment
pigz_exec: pigz
project: p1
reference:
  GTF_file: /mnt/vol_extra/workspace/reference_genome/refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf
  STAR_index: /mnt/vol-extra/workspace/reference_genome/STAR_index
  additional_STAR_params: --limitSjdbInsertNsj 1022629
  additional_files: null
  exon_extension: false
  extension_length: 0
  scaffold_length_min: 0
samtools_exec: samtools
sequence_files:
  file1:
    base_definition:
    - BC(1-16)
    - UMI(17-26)
    name: /mnt/vol-extra/scRNA_data/CD28_P1/CD28_P1_S1_L001_R1_001.fastq.gz
  file2:
    base_definition:
    - cDNA(1-101)
    name: /mnt/vol-extra/scRNA_data/CD28_P1/CD28_P1_S1_L001_R2_001.fastq.gz
  file3:
    base_definition:
    - UMI(1-8)
    name: /mnt/vol-extra/scRNA_data/CD28_P1/CD28_P1_S1_L001_I1_001.fastq.gz
which_Stage: Filtering
zUMIs_directory: ./zUMIs
read_layout: SE

p2.run.yaml

Rscript_exec: Rscript
STAR_exec: STAR
barcodes:
  BarcodeBinning: 1
  automatic: true
  barcode_file: null
  barcode_num: null
  barcode_sharing: null
  demultiplex: false
  nReadsperCell: 100
counting_opts:
  Ham_Dist: 0
  downsampling: 0
  intronProb: false
  introns: true
  multi_overlap: false
  primaryHit: true
  strand: 0
  twoPass: true
  velocyto: false
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 20
make_stats: true
mem_limit: 65
num_threads: 12
out_dir: /mnt/vol-extra/workspace/alignment
pigz_exec: pigz
project: p2
reference:
  GTF_file: /mnt/vol_extra/workspace/reference_genome/refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf
  STAR_index: /mnt/vol-extra/workspace/reference_genome/STAR_index
  additional_STAR_params: --limitSjdbInsertNsj 1022629
  additional_files: null
  exon_extension: false
  extension_length: 0
  scaffold_length_min: 0
samtools_exec: samtools
sequence_files:
  file1:
    base_definition:
    - BC(1-16)
    - UMI(17-26)
    name: /mnt/vol-extra/scRNA_data/CD28_P2/CD28_P2_S1_L001_R1_001.fastq.gz
  file2:
    base_definition:
    - cDNA(1-101)
    name: /mnt/vol-extra/scRNA_data/CD28_P2/CD28_P2_S1_L001_R2_001.fastq.gz
  file3:
    base_definition:
    - UMI(1-8)
    name: /mnt/vol-extra/scRNA_data/CD28_P2/CD28_P2_S1_L001_I1_001.fastq.gz
which_Stage: Filtering
zUMIs_directory: ./zUMIs
read_layout: SE
cziegenhain commented 3 years ago

Hi,

I'm not sure I follow all the explanations, however it seems there is a mistake in your YAML file: the I1_001.fastq.gzcontain BC not UMI in 10x data! Also, if each of the fastq files corresponds to one 10x sample, you do not need to use this file at all.

Hopefully this solves your issue. Best, C

HealHer commented 3 years ago

Hi,

Thank you very much for your quick answer.

Indeed, it is a single 10x sample. So I removed the definition of file 3 in the yaml file but this did not correct our problem.

QC shows that the number and quality of reads, cells are nearly identical however the representation by umap, linear regression and differential analysis shows big differences on the expressed genes.

My question is about the joint use of the 10x tool (cellranger) and zUMIs. The goal is not to benchmark but more to understand if the two outputs should be similar or not in the genes detected by the 2 tools on the same sample.

The same indexes were used in both cases with the same genome annotation file (.gtf)

Do you have any insight on this ?

Thank you very much for your time,

Alexis

cziegenhain commented 2 years ago

Feel free to reopen the issue if you still need assistance.