sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
268 stars 67 forks source link

Smart-seq-3 data subsets systematically fail after counting due to chunking #351

Closed Romain-B closed 1 year ago

Romain-B commented 1 year ago

Hello & thanks for the great tool!

I have encountered a cryptic error when testing the pipeline on a 100k read subset of my Smart-seq3 data. SubRead works properly and finds both (ex) and (in) counts but then the zUMIs follow-up crashes, as shown below.

# after subread finishes
[1] "2023-02-23 09:56:52 CET"
[1] "Coordinate sorting intermediate bam file..."
[bam_sort_core] merging from 0 files and 6 in-memory blocks...
[1] "2023-02-23 09:56:52 CET"
[1] "Hamming distance collapse in barcode chunk 1 out of 1"
[1] "Splitting data for multicore hamming distance collapse..."
[1] "Setting up multicore cluster & generating molecule mapping tables ..."
[1] "Finished multi-threaded hamming distances"
[1] "Correcting UMI barcode tags..."
Loading molecule correction dictionary...
Correcting UB tags...
[1] "9e+07 Reads per chunk"
[1] "2023-02-23 09:57:07 CET"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 1 barcodes in this chunk..."
Error in eval(bysub, x, parent.frame()) : 
  object 'readcount_internal' not found
Calls: convert2countM -> .makewide -> [ -> [.data.table -> eval -> eval
Execution halted

readcount_internal is specific to Ss3 processing by zUMIs, and after digging into the functions of zUMIs I figured out that the issue lies in the umiCollapseID function.

By printing the nb of non-UMI reads and total number of reads just before detection of internal reads for Ss3, it appears that zUMIs chunked the data in 6 which resulted in 2 small chunks with an equal number of reads and non-UMI reads.

[1] "9e+07 Reads per chunk"
[1] "2023-02-23 10:12:52 CET"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 1 barcodes in this chunk..."
[1] "PRINT n_nonUMI followed by nreads"
[1] 115
[1] 115
[1] "PRINT n_nonUMI followed by nreads"
[1] 115
[1] 115
[1] "PRINT n_nonUMI followed by nreads"
[1] 79205
[1] 81314
[1] "PASSED THE IF STATEMENT L239"
[1] "PRINT n_nonUMI followed by nreads"
[1] 79320
[1] 81429
[1] "PASSED THE IF STATEMENT L239"
[1] "PRINT n_nonUMI followed by nreads"
[1] 79205
[1] 81314
[1] "PASSED THE IF STATEMENT L239"
[1] "PRINT n_nonUMI followed by nreads"
[1] 79320
[1] 81429
[1] "PASSED THE IF STATEMENT L239"
Error in eval(bysub, x, parent.frame()) : 
  object 'readcount_internal' not found
Calls: convert2countM -> .makewide -> [ -> [.data.table -> eval -> eval
Execution halted

Because this line only validates strictly inferior non-UMI vs. nb of reads, the internal_reads object is never made for those chunks, which results in the error.

By changing

    if(n_nonUMI > 0 & n_nonUMI < nreads){ #detect mix of internal and UMI reads in Smartseq3

to

    if(n_nonUMI > 0 & n_nonUMI <= nreads){ #detect mix of internal and UMI reads in Smartseq3

The pipeline runs fine, which confirms this is the failing step.

I don't know how many reads one should use for testing, but given the ratio of UMI to internal reads can be variable in Ss3 data, I'm probably not in a unique situation.

I don't think there is any side effect to changing the evaluation above, but another fix could perhaps be to create the readcount_internal structure for all Ss3 chunks, regardless of the composition.

What do you think?

cziegenhain commented 1 year ago

Hi,

Sorry for the slow reply, but thank you for hunting down the issue so systematically! I agree, this is probably just a very unlucky case I have not encountered before because I typically test with ~10M reads at least. I don't see any issues with the suggested change and will push an update shortly!

Again, many thanks, Christoph