sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
269 stars 67 forks source link

some warning report and question about barcodes #344

Closed hulilalia closed 1 year ago

hulilalia commented 1 year ago

I used zUMIs to preprocess smartseqv3 data. But I met some error when it is in counting stage.

[1] "Coordinate sorting intermediate bam file..." [bam_sort_core] merging from 0 files and 15 in-memory blocks... [1] "2022-12-27 11:17:27 CST" [1] "Hamming distance collapse in barcode chunk 1 out of 1" [1] "Splitting data for multicore hamming distance collapse..." [1] "Setting up multicore cluster & generating molecule mapping tables ..." [1] "Finished multi-threaded hamming distances" [1] "Correcting UMI barcode tags..." Loading molecule correction dictionary... Correcting UB tags... [1] "4.5e+08 Reads per chunk" [1] "2022-12-27 11:23:35 CST" [1] "Here are the detected subsampling options:" [1] "Automatic downsampling" [1] "Working on barcode chunk 1 out of 1" [1] "Processing 768 barcodes in this chunk..." Warning message: In parallel::mclapply(mapList, function(tt) { : 用户代码中所有预定的核心都出了错 Error in h(simpleError(msg, call)) : 在为'strsplit'函数选择方法时评估'x'参数出了错: 找不到对象'GE' Calls: convert2countM ... .makewide -> unlist -> strsplit -> .handleSimpleError -> h 停止执行

Here is my bug report. I am sorry for some chinese, but the report told me that there is some bug when 'zUMIs-dge2.R' operated.
And the warning message is the same as what hcph reports in #187 , These reports indicate that it can't find the tag "GE" in my sorted.bam file , but I can find GE tag in my sorted.bam file in fact.
If my understanding is right, any one can tell me what's the wrong?

hulilalia commented 1 year ago

Now I have found that error comes from the function .sampleReads4collapsing. This function process reads data.table, and I have gotten error report from the sample function in .sampleReads4collapsing. What makes me confused is these function works well in my laptop but in my work station. In R4.0.5 ,sample works with two vector, but not in R4.2.2. That is the reason my program failed in workstation but succeed in laptop. So, what is the meaning of sampling row number? I am really confused.

cziegenhain commented 1 year ago

Hi hulilalia,

sorry for the slow reply over the holiday period. I am fairly certain about the origin of this error, suspecting the default behavior of one of the base functions has changed with a new R version. anyways, I pushed a small update that hopefully fixes this issue, let me know!

As to the use of this function: we sample reads to count towards gene expression values, this is helpful for the downsampling functionality that is unique to zUMIs (and of course in the default case of considering all reads, all reads get sampled).

Best, Christoph

hulilalia commented 1 year ago

Hi hulilalia,

sorry for the slow reply over the holiday period. I am fairly certain about the origin of this error, suspecting the default behavior of one of the base functions has changed with a new R version. anyways, I pushed a small update that hopefully fixes this issue, let me know!

As to the use of this function: we sample reads to count towards gene expression values, this is helpful for the downsampling functionality that is unique to zUMIs (and of course in the default case of considering all reads, all reads get sampled).

Best, Christoph

Thank you for your reply.

I have confirmed that your correction version works well. But there is still some warning report in my use: [W::bam_hdr_read] bgzf_check_EOF: Invalid argument [E::bam_hdr_read] Invalid BAM binary header [bam_cat] ERROR: couldn't read header for '/home/XXX/smartseqv3/test_AN_20221011/outs/zUMIs_output/.tmpMap //tmp.Smartseq3_AN_20221011.2.Aligned.toTranscriptome.out.bam'. [W::bam_hdr_read] bgzf_check_EOF: Invalid argument [E::bam_hdr_read] Invalid BAM binary header [bam_cat] ERROR: couldn't read header for '/home/XXX/smartseqv3/test_AN_20221011/outs/zUMIs_output/.tmpMap //tmp.Smartseq3_AN_20221011.2.Aligned.out.bam'. And I have another question: I have all of my cells barcode in 'xx_kept_barcodes.txt' in zUMIs_output dir. But some cells' reads lost in 'xx_kept_barcodes.txt.BCUMIstats.txt' in zUMIs_output dir. Does it mean that there is any setting error in my configure file?

Again, thank you for your help!

hulilalia commented 1 year ago

I found that only if I remove codes that remove tmpfile folder in zUMIs.sh, does the pipeline work in my environment.

cziegenhain commented 1 year ago

Hi, regarding your questions:

hulilalia commented 1 year ago

Hi, regarding your questions:

  • [W::bam_hdr_read] bgzf_check_EOF: Invalid argument [E::bam_hdr_read] Invalid BAM binary header [bam_cat]: I would assume that there was an error/problem during one of the STAR jobs, so please double check the log carefully.
  • missing barcodes between kept_barcodes.txt to kept_barcodes.txt.BCUMIstats.txt: very odd, could you provide more details here? how many barcodes miss, how many reads should they have according to kept_barcodes?
  • regarding your last comment about changing the code in zUMIs.sh: We cannot support any custom changed versions of zUMIs, the pipeline runs as posted on GitHub and modifications should not be needed. Best, Christoph

Hi, Thank you for your quick reply!

In my last wrong run, zUMIs mapped all of reads and produced 3 'Aligned.out.bam' files separately. There is some error report related with them, so I change codes and retain the .tmp_map folder. I found that one of them is empty! Maybe that is the reason my barcodes lost in counting phase.

After I change the parameter 'mem_limit' in configure file from null to 60, the bug I reported last time disappeared.

My problem has been solved and thank you for your help!

And I hope that my report will help you improve zUMIs.