sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
274 stars 67 forks source link

double free or corruption in R #276

Closed YOU-k closed 3 years ago

YOU-k commented 3 years ago

Hi who' concerning, I am currently using zUMIs on a 10xv3 dataset and got an error. Here is how I write the yaml file: `###########################################

Welcome to zUMIs

below, please fill the mandatory inputs

We expect full paths for all files.

###########################################

define a project name that will be used to name output files

project: pbmc10k

Sequencing File Inputs:

For each input file, make one list object & define path and barcode ranges

base definition vocabulary: BC(n) UMI(n) cDNA(n).

Barcode range definition needs to account for all ranges. You can give several comma-separated ranges for BC & UMI sequences, eg. BC(1-6,20-26)

you can specify between 1 and 4 input files

sequence_files: file1: name: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess_update/pbmc_data/raw_data/10k_pbmc/com10kpbmc_S1_L001_R1.fastq.gz base_definition:

reference genome setup

reference: STAR_index: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess/zumis/homo_star_idx GTF_file: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess/data/homo_ercc.gtf exon_extension: no #extend exons by a certain width? extension_length: 0 #number of bp to extend exons by scaffold_length_min: 0 #minimal scaffold/chromosome length to consider (0 = all) additional_files: null additional_STAR_params: null

output directory

out_dir: /stornext/HPCScratch/home/you.y/preprocess_update/raw_results/zumis_old/pbmc10k

###########################################

below, you may optionally change default parameters

###########################################

number of processors to use

num_threads: 128 mem_limit: 400 #Memory limit in Gigabytes, null meaning unlimited RAM usage.

barcode & UMI filtering options

number of bases under the base quality cutoff that should be filtered out.

Phred score base-cutoff for quality control.

filter_cutoffs: BC_filter: num_bases: 1 phred: 20 UMI_filter: num_bases: 1 phred: 20

Options for Barcode handling

You can give either number of top barcodes to use or give an annotation of cell barcodes.

If you leave both barcode_num and barcode_file empty, zUMIs will perform automatic cell barcode selection for you!

barcodes: barcode_num: null barcode_file: null barcode_sharing: null #Optional for combining several barcode sequences per cell (see github wiki) automatic: yes #Give yes/no to this option. If the cell barcodes should be detected automatically. If the barcode file is given in combination with automatic barcode detection, the list of given barcodes will be used as whitelist. BarcodeBinning: 1 #Hamming distance binning of close cell barcode sequences. nReadsperCell: 0 #Keep only the cell barcodes with atleast n number of reads. demultiplex: no #produce per-cell demultiplexed bam files.

Options related to counting of reads towards expression profiles

counting_opts: introns: yes #can be set to no for exon-only counting. intronProb: no #perform an estimation of how likely intronic reads are to be derived from mRNA by comparing to intergenic counts. downsampling: 0 #Number of reads to downsample to. This value can be a fixed number of reads (e.g. 10000) or a desired range (e.g. 10000-20000) Barcodes with less than will not be reported. 0 means adaptive downsampling. Default: 0. strand: 0 #Is the library stranded? 0 = unstranded, 1 = positively stranded, 2 = negatively stranded Ham_Dist: 1 #Hamming distance collapsing of UMI sequences. velocyto: no #Would you like velocyto to do counting of intron-exon spanning reads primaryHit: yes #Do you want to count the primary Hits of multimapping reads towards gene expression levels? multi_overlap: no #Do you want to assign reads overlapping to multiple features? fraction_overlap: 0 #minimum required fraction of the read overlapping with the gene for read assignment to genes twoPass: yes #perform basic STAR twoPass mapping

produce stats files and plots?

make_stats: yes

Start zUMIs from stage. Possible TEXT(Filtering, Mapping, Counting, Summarising). Default: Filtering.

which_Stage: Filtering

define dependencies program paths

samtools_exec: samtools #samtools executable Rscript_exec: Rscript #Rscript executable STAR_exec: STAR #STAR executable pigz_exec: pigz #pigz executable

below, fqfilter will add a read_layout flag defining SE or PE

zUMIs_directory: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess_update/tools/zUMIs `

But I get a bug like (which is so long, so I only copy the head and end of it ): `

You provided these parameters: YAML file: zUMIs.yaml zUMIs directory: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess_update/tools/zUMIs STAR executable STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: 400 zUMIs version 2.9.7

Wed Aug 11 19:56:32 AEST 2021 WARNING: The STAR version used for mapping is 2.6.1c and the STAR index was created using the version 20201. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.6.1c. Filtering... Wed Aug 11 20:24:54 AEST 2021 [1] "83609 barcodes detected." [1] "18487064 reads were assigned to barcodes that do not correspond to intact cells." Error in `/stornext/System/data/apps/R/R-4.0.3/lib64/R/bin/exec/R': double free or corruption (!prev): 0x000000000a72fbc0 ======= Backtrace: ========= /lib64/libc.so.6(+0x81329)[0x2b27ee194329] /stornext/Home/data/allstaff/y/you.y/R/x86_64-pc-linux-gnu-library/4.0/data.table/libs/datatable.so(+0x216ae)[0x2b27fa0756ae] /stornext/Home/data/allstaff/y/you.y/R/x86_64-pc-linux-gnu-library/4.0/data.table/libs/datatable.so(forder+0x5e9)[0x2b27fa078ad9] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0xf95d4)[0x2b27ed4ea5d4] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x13a2ae)[0x2b27ed52b2ae] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_eval+0x70)[0x2b27ed53c170] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x14cf5f)[0x2b27ed53df5f] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_applyClosure+0x1a2)[0x2b27ed53ee22] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x13ce0e)[0x2b27ed52de0e] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_eval+0x70)[0x2b27ed53c170] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x14cf5f)[0x2b27ed53df5f] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_applyClosure+0x1a2)[0x2b27ed53ee22] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x13ce0e)[0x2b27ed52de0e] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_eval+0x70)[0x2b27ed53c170] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x14cf5f)[0x2b27ed53df5f] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_applyClosure+0x1a2)[0x2b27ed53ee22] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x18fdb4)[0x2b27ed580db4] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x1901f0)[0x2b27ed5811f0] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x12d612)[0x2b27ed51e612] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x1326b5)[0x2b27ed5236b5] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_eval+0x70)[0x2b27ed53c170] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x14cf5f)[0x2b27ed53df5f] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_applyClosure+0x1a2)[0x2b27ed53ee22] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_eval+0x27f)[0x2b27ed53c37f] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x150172)[0x2b27ed541172] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_eval+0x54b)[0x2b27ed53c64b] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(Rf_ReplIteration+0x252)[0x2b27ed56f002] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(+0x17e380)[0x2b27ed56f380] /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so(run_Rmainloop+0x48)[0x2b27ed56f418] /stornext/System/data/apps/R/R-4.0.3/lib64/R/bin/exec/R(main+0x1b)[0x40075b] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b27ee135555] /stornext/System/data/apps/R/R-4.0.3/lib64/R/bin/exec/R[0x40078b] ======= Memory map: ======== 00400000-00401000 r-xp 00000000 00:28 838638208 /stornext/System/data/apps/R/R-4.0.3/lib64/R/bin/exec/R 00600000-00601000 r--p 00000000 00:28 838638208 /stornext/System/data/apps/R/R-4.0.3/lib64/R/bin/exec/R 00601000-00602000 rw-p 00001000 00:28 838638208 /stornext/System/data/apps/R/R-4.0.3/lib64/R/bin/exec/R 0068f000-4bdfa000 rw-p 00000000 00:00 0 [heap] 2b27ed1cd000-2b27ed1ef000 r-xp 00000000 fd:02 37007691 /usr/lib64/ld-2.17.so 2b27ed1ef000-2b27ed1f3000 rw-p 00000000 00:00 0 2b27ed1f3000-2b27ed1fa000 r--s 00000000 fd:02 33560086 /usr/lib64/gconv/gconv-modules.cache 2b27ed1fa000-2b27ed1fb000 r--p 00000000 00:28 839390733 /stornext/System/data/apps/R/R-4.0.3/lib64/R/library/translations/en/LC_MESSAGES/R.mo 2b27ed1fb000-2b27ed1fe000 rw-p 00000000 00:00 0 2b27ed200000-2b27ed205000 r--s 00000000 fd:02 68078929 /usr/lib/fontconfig/cache/a29c3d10-168a-4dbd-aa37-ca8e3fc46e77-le64.cache-7 2b27ed205000-2b27ed207000 r--s 00000000 fd:02 69415865 /usr/lib/fontconfig/cache/a1a65367-b968-4a84-abfc-533e283c233c-le64.cache-7 2b27ed207000-2b27ed277000 rw-p 00000000 00:00 0 2b27ed277000-2b27ed287000 r--s 00000000 fd:02 69010359 /usr/lib/fontconfig/cache/4f05e3a7-bb5d-4ec4-a077-5c513b1e1790-le64.cache-7 2b27ed287000-2b27ed28f000 r--s 00000000 fd:02 69350775 /usr/lib/fontconfig/cache/51a39caf-4ec0-4b29-9e9a-d850a04ad85e-le64.cache-7 2b27ed28f000-2b27ed2a4000 r--p 00000000 fd:02 101078123 /usr/share/fonts/urw-base35/NimbusSans-Regular.otf 2b27ed2a4000-2b27ed2b9000 r--p 00000000 fd:02 101078114 /usr/share/fonts/urw-base35/NimbusSans-Bold.otf 2b27ed2b9000-2b27ed2cf000 r--p 00000000 fd:02 101078120 /usr/share/fonts/urw-base35/NimbusSans-Italic.otf 2b27ed2cf000-2b27ed2e5000 r--p 00000000 fd:02 101078117 /usr/share/fonts/urw-base35/NimbusSans-BoldItalic.otf 2b27ed2e5000-2b27ed2ed000 r--p 00000000 fd:02 101176089 /usr/share/fonts/urw-base35/StandardSymbolsPS.t1 2b27ed2f8000-2b27ed359000 rw-p 00000000 00:00 0 2b27ed3ee000-2b27ed3ef000 r--p 00021000 fd:02 37007691 /usr/lib64/ld-2.17.so 2b27ed3ef000-2b27ed3f0000 rw-p 00022000 fd:02 37007691 /usr/lib64/ld-2.17.so 2b27ed3f0000-2b27ed3f1000 rw-p 00000000 00:00 0 2b27ed3f1000-2b27ed71c000 r-xp 00000000 00:28 838638272 /stornext/System/data/apps/R/R-4.0.3/lib64/R/lib/libR.so .............

2b2a40000000-2b2a40249000 rw-p 00000000 00:00 0 2b2a40249000-2b2a44000000 ---p 00000000 00:00 0 7ffdf386f000-7ffdf38ce000 rw-p 00000000 00:00 0 [stack] 7ffdf39bc000-7ffdf39be000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess_update/tools/zUMIs/zUMIs.sh: line 285: 30271 Aborted (core dumped) ${Rexc} ${zumisdir}/zUMIs-BCdetection.R ${yaml} Mapping... [1] "2021-08-11 20:30:18 AEST" [E::hts_open_format] Failed to open file "NA" : No such file or directory samtools view: failed to open "NA" for reading: No such file or directory [E::hts_open_format] [E::hts_open_format] Failed to open file "NA" : No such file or directoryFailed to open file "NA" : No such file or directory `

Is there any suggestion that what might be the issue here? Many thanks! Chloe

cziegenhain commented 3 years ago

Hi,

This one is a bit cryptic - seems like an error in one of the R libraries. The only thing that comes to mind: I have had some mysterious crashes when using many threads in R data.table, can you try the analysis with eg. 48 CPUs instead of 128?

Unrelated to the crash I have some suggestions to your yaml file: Barcode and UMI filtering is very strict allowing only 1 base under phred 20, considering the very long BC & UMI of 10x data. Also consider adding the known 10x barcode whitelist to make sure your automatically selected cells are true droplets.

Best, Christoph

YOU-k commented 3 years ago

Thanks Christoph, I managed to make zUMIs work. But found it is quite slow.

Here is the new yaml file ` ###########################################

Welcome to zUMIs

below, please fill the mandatory inputs

We expect full paths for all files.

###########################################

define a project name that will be used to name output files

project: pbmc5k

Sequencing File Inputs:

For each input file, make one list object & define path and barcode ranges

base definition vocabulary: BC(n) UMI(n) cDNA(n).

Barcode range definition needs to account for all ranges. You can give several comma-separated ranges for BC & UMI sequences, eg. BC(1-6,20-26)

you can specify between 1 and 4 input files

sequence_files: file1: name: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess_update/pbmc_data/raw_data/5k_pbmc/com5kpbmc_S1_L001_R1.fastq.gz base_definition:

reference genome setup

reference: STAR_index: /stornext/HPCScratch/home/you.y/preprocess_update/raw_results/zumis/human_star_idx/ GTF_file: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess/data/Homo_sapiens.GRCh38.98.gtf exon_extension: no #extend exons by a certain width? extension_length: 0 #number of bp to extend exons by scaffold_length_min: 0 #minimal scaffold/chromosome length to consider (0 = all) additional_files: null additional_STAR_params: null

output directory

out_dir: /stornext/HPCScratch/home/you.y/preprocess_update/raw_results/zumis/pbmc5k ###########################################

below, you may optionally change default parameters

###########################################

number of processors to use

num_threads: 32 mem_limit: null #Memory limit in Gigabytes, null meaning unlimited RAM usage.

barcode & UMI filtering options

number of bases under the base quality cutoff that should be filtered out.

Phred score base-cutoff for quality control.

filter_cutoffs: BC_filter: num_bases: 1 phred: 20 UMI_filter: num_bases: 1 phred: 20

Options for Barcode handling

You can give either number of top barcodes to use or give an annotation of cell barcodes.

If you leave both barcode_num and barcode_file empty, zUMIs will perform automatic cell barcode selection for you!

barcodes: barcode_num: null barcode_file: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess/data/10xv3_whitelist.txt barcode_sharing: null #Optional for combining several barcode sequences per cell (see github wiki) automatic: no #Give yes/no to this option. If the cell barcodes should be detected automatically. If the barcode file is given in combination with automatic barcode detection, the list of given barcodes will be used as whitelist. BarcodeBinning: 1 #Hamming distance binning of close cell barcode sequences. nReadsperCell: 0 #Keep only the cell barcodes with atleast n number of reads. demultiplex: no #produce per-cell demultiplexed bam files.

Options related to counting of reads towards expression profiles

counting_opts: introns: yes #can be set to no for exon-only counting. intronProb: no #perform an estimation of how likely intronic reads are to be derived from mRNA by comparing to intergenic counts. downsampling: 0 #Number of reads to downsample to. This value can be a fixed number of reads (e.g. 10000) or a desired range (e.g. 10000-20000) Barcodes with less than will not be reported. 0 means adaptive downsampling. Default: 0. strand: 0 #Is the library stranded? 0 = unstranded, 1 = positively stranded, 2 = negatively stranded Ham_Dist: 1 #Hamming distance collapsing of UMI sequences. velocyto: no #Would you like velocyto to do counting of intron-exon spanning reads primaryHit: yes #Do you want to count the primary Hits of multimapping reads towards gene expression levels? multi_overlap: no #Do you want to assign reads overlapping to multiple features? fraction_overlap: 0 #minimum required fraction of the read overlapping with the gene for read assignment to genes twoPass: yes #perform basic STAR twoPass mapping

produce stats files and plots?

make_stats: no

Start zUMIs from stage. Possible TEXT(Filtering, Mapping, Counting, Summarising). Default: Filtering.

which_Stage: Filtering

define dependencies program paths

samtools_exec: samtools #samtools executable Rscript_exec: Rscript #Rscript executable STAR_exec: STAR #STAR executable pigz_exec: pigz #pigz executable

below, fqfilter will add a read_layout flag defining SE or PE

zUMIs_directory: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess_update/tools/zUMIs `

And here is new shell script. `#!/bin/bash

SBATCH --time=90:00:00

SBATCH --cpus-per-task=32

SBATCH --mem=600G

SBATCH --mail-type=BEGIN,END,FAIL

project="pbmc5k" tool_p="/stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess_update/tools/zUMIs" work_p="/stornext/HPCScratch/home/you.y/preprocess_update/raw_results/zumis" cd $work_p cd $project module load samtools module load R/4.0.3 module load STAR/2.6.1c module load pigz

$tool_p/zUMIs.sh -y zUMIs.yaml`

And here is messages I have till now: `

You provided these parameters: YAML file: zUMIs.yaml zUMIs directory: /stornext/General/data/user_managed/grpu_mritchie_1/Yue/preprocess_update/tools/zUMIs STAR executable STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: null zUMIs version 2.9.7

Mon Aug 16 15:26:29 AEST 2021 WARNING: The STAR version used for mapping is 2.6.1c and the STAR index was created using the version 20201. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.6.1c. Filtering... Mon Aug 16 16:01:37 AEST 2021 [1] "8626434 reads were assigned to barcodes that do not correspond to intact cells."`

Is there any idea why the cell barcode binning takes this long time? And how can I accelerate it? Many thanks! Chloe

cziegenhain commented 3 years ago

Hi Chloe,

I managed to make zUMIs work.

Good! Was there anything else you needed to change other than reducing the number of threads to get rid of this particular error? Do you know if the error was reproducible at the same stage when running several times? Thanks for this information, I will collect issues like this to contact the developers of the data.table package with eventually.

But found it is quite slow.

Unfortunately, the barcode error correction step can bog down if there is a huge number of cell barcodes. I suggest that you use the combination of 10x Barcode whitelist and automatic barcode detection (to restrict yourself to all real cells as opposed to the millions of annotated 10x Barcodes). automatic: yes I'm sure that will help speed things up a lot! An alternative is also to increase the nReadsperCell: parameter.

Some other comments on your yaml file:

This depends on the sequencing quality but I typically apply the following settings:

filter_cutoffs: BC_filter: num_bases: 5 phred: 20 UMI_filter: num_bases: 4 phred: 20

This will help to retain a good number of reads per cell.

I noticed that you are the first author on the recent biorxiv comparing preprocessing tools including zUMIs. Good to see that so far zUMIs was doing decently, and feel free to get in touch if you come across issues or surprising results that you would like to discuss. Since there is such a wide variety of parameters for the gene assignment & counting (eg. STAR 2-pass, feature overlap, multimapping, etc.) I'm sure many results are very dependent on the exact combination of settings and data at hand.

Best, Christoph

YOU-k commented 3 years ago

Hi Christoph, Thanks for all these suggestions. It works if I set nReadsperCell to 10. It runs much faster. Cheers, Chloe

cziegenhain commented 3 years ago

Feel free to reopen the issue if you still need assistance.