haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
189 stars 20 forks source link

Seemingly empty files trigger segmentation fault error #5

Closed tzeitim closed 2 years ago

tzeitim commented 3 years ago

Using the pre-compiled version I get the following error:

./chromap --preset hic -x test.index -r ref.fa -1 r1.fq -2 r2.fq -o out.sam --SAM
Preset parameters for Hi-C are used.
Start to map reads.
Parameters: error threshold: 4, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 1000, MAPQ-threshold: 1, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 1
Analyze bulk data.
Won't try to remove adapters on 3'.
Won't remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Allow split alignment.
Output mappings in SAM format.
Reference file: ref.fa
Index file: test.index
1th read 1 file: r1.fq
1th read 2 file: r2.fq
Output file: out.sam
Loaded all sequences successfully in 4.05s, number of sequences: 1924, number of bases: 1679214157.
Kmer size: 17, window size: 7.
Lookup table size: 177152077, occurrence table size: 309193718.
Loaded index successfully in 7.85s.
Mapped all reads in 0.53s.
Number of reads: 0.
Number of mapped reads: 0.
Number of uniquely mapped reads: 0.
Number of reads have multi-mappings: 0.
Number of candidates: 0.
Number of mappings: 0.
Number of uni-mappings: 0.
Number of multi-mappings: 0.
Segmentation fault

The fastq files are not empty as the error message would suggest, nonetheless they are non-canonical HiC but 4C, actually, and they are derived from a rather unbalanced read length between mates, r1=28 vs r2=120. Could this be violating some assumptions of the software?

head r1.fq
@NS500648:585:HK55NAFX2:1:11101:9001:1025 1:N:0:ACAAGGTA
GCCACNACCTGTTCCTGTGTATCTGCTA
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEE
@NS500648:585:HK55NAFX2:1:11101:13450:1033 1:N:0:ACAAGGTA
GCCACNACCTGTTCCTGTTGAGGATTTG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEE
@NS500648:585:HK55NAFX2:1:11101:12961:1039 1:N:0:CGTCCCGT
GCCACNACCTGTTCCTGTTTTCTCTCTG
(metacell) [16:29:53 polivar@aquarius:~/src/chromap_l/chromap-80f75ed_x64-linux] (34512)$
head r2.fq
@NS500648:585:HK55NAFX2:1:11101:9001:1025 2:N:0:ACAAGGTA
NNNNNGCATGCAAAGTCAAAAAACACTNTCATGGTCTTATAATCTGCATTTATTTTTACCTAATNNNCCNNNNGACTCNNNTANNNNTCNTNCANTGATTCNNNTGNTTCCAANNCCNNN
+
#####EEEEEEEEEEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEE###EE####E<EEE###A/####EE#E#AE#EEE/AA###AA#AAAAA/##<E###
@NS500648:585:HK55NAFX2:1:11101:13450:1033 2:N:0:ACAAGGTA
NNNTGTTATTGTTACATTTTTCTTACAGCCCTTTAACAATCTCATTTCATTTTCATCCGCTTTCGGGGCTGGGTCATGAAGGCCGCAGTCTTAGGTAAGAACTCCACATACATAGTCCCG
+
###AAAEEEEEEE/EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEAEEEAEEEEEEE/EEEAAEEEEAEEEEEEEEEEAEEEAEE<AEE<AAA<AAEA6<6AA<EEEAEEEE/
@NS500648:585:HK55NAFX2:1:11101:12961:1039 2:N:0:CGTCCCGT
NNNCAATGGAAGTAAGGGTCCAAACGCAGTTTATTAACAGCTAAGCCAAGCAGGCAAGGGTGAAACTTTGGCAAACGGAATTATGTGGGTAATCCAATAGCACAGTTGTAGTCACAGACT

sed -n '2~4p' r1.fq | wc -l
664948
sed -n '2~4p' r2.fq | wc -l
664948

Edit: The software runs well using the provided fastqs and test ref.

mourisl commented 3 years ago

Just in case of incompatibility of binary files, could you please install Chromap through conda/bioconda : https://anaconda.org/bioconda/chromap ? Thank you.

tzeitim commented 3 years ago

Will do. I had missed the conda as an option for installation., thanks for the pointer.

tzeitim commented 3 years ago

After installing chromap (build h9a82719_0) via conda and regenerating the index, the error message persists when invoking chromap with hic presets

which chromap
/local/users/polivar/miniconda3/envs/chromap/bin/chromap

chromap --preset hic -x test.index -r ref.fa -1 r1.fq -2 r2.fq -o out.sam --SAM
Preset parameters for Hi-C are used.
Start to map reads.
Parameters: error threshold: 4, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 1000, MAPQ-threshold: 1, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 1
Analyze bulk data.
Won't try to remove adapters on 3'.
Won't remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Allow split alignment.
Output mappings in SAM format.
Reference file: ref.fa
Index file: test.index
1th read 1 file: r1.fq
1th read 2 file: r2.fq
Output file: out.sam
Loaded all sequences successfully in 3.36s, number of sequences: 1924, number of bases: 1679214157.
Kmer size: 17, window size: 7.
Lookup table size: 177152077, occurrence table size: 309193718.
Loaded index successfully in 7.93s.
Mapped all reads in 0.53s.
Number of reads: 0.
Number of mapped reads: 0.
Number of uniquely mapped reads: 0.
Number of reads have multi-mappings: 0.
Number of candidates: 0.
Number of mappings: 0.
Number of uni-mappings: 0.
Number of multi-mappings: 0.
Segmentation fault
mourisl commented 3 years ago

We just took a deeper look at your data. It seems since one of the end is only 28bp, it is below the threshold --min-read-length (30). So all the reads were filtered directly and the output file became empty. Even though it is expected to see 0 number of reads in the summary, it should not segfault and we will fix it.

Just out of curiosity, why the read1 length is only 28bp? Thank you.

tzeitim commented 3 years ago

Thanks for the quick response.

    Just out of curiosity, why the read1 length is only 28bp? Thank you.

Sorry it this non-canonical format is raising this issues. This data is comes from a 4C library spiked into a 10x run - that's why R1 is so short.

After adjusting the --min-read-length to 28 reads were loaded but unfortunately not mapped and there is still a seg fault error. I wonder if this unconventional run breaks other assumptions in the code. Maybe a hint could be the fact that one message confirms that some read pairs were mapped but then reads appear as non-mapped.

chromap --preset hic -x test.index -r ref.fa -1 r1.fq -2 r2.fq -o out.sam --SAM --min-read-length 28
Preset parameters for Hi-C are used.
Start to map reads.
Parameters: error threshold: 4, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 1000, MAPQ-threshold: 1, min-read-length: 28, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 1
Analyze bulk data.
Won't try to remove adapters on 3'.
Won't remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Allow split alignment.
Output mappings in SAM format.
Reference file: ref.fa
Index file: test.index
1th read 1 file: r1.fq
1th read 2 file: r2.fq
Output file: out.sam
Loaded all sequences successfully in 3.69s, number of sequences: 1924, number of bases: 1679214157.
Kmer size: 17, window size: 7.
Lookup table size: 177152077, occurrence table size: 309193718.
Loaded index successfully in 6.83s.
Mapped 500000 read pairs in 28.69s.
Mapped 164948 read pairs in 10.75s.
Mapped all reads in 40.59s.
Number of reads: 1329896.
Number of mapped reads: 0.
Number of uniquely mapped reads: 0.
Number of reads have multi-mappings: 0.
Number of candidates: 4808026.
Number of mappings: 0.
Number of uni-mappings: 0.
Number of multi-mappings: 0.
Segmentation fault
mourisl commented 3 years ago

We have fixed the segmentation fault error when no read can be aligned or empty read files. I think it is hard to find seeds for 28bp given the minimizer setting. One possible way is to set the option --min-num-seeds to 1 (default is 2). Though Chromap should be able to rescue some of those read1s with the mate-pair information from the 120bp read2, it is very strange to still observe 0 mapped reads.

For the 28bp read1, they all start with the same nucleotides, is it barcode/primer? Or is it genomic sequences and the same prefix is just coincident? Thank you.

tzeitim commented 3 years ago

Thanks for the suggestion to change --min-num-seeds to 1. Unfortunately it also Segfaults. In appreciation of your time perhaps it is better to stop here for now. I will re-sequence the same libraries with longer read1 to see whether the length is the issue or rather the fact that it is not really HiC.

These data being 4C, read1 is always the same as a "one-vs-all" approach in which one profiles the contacts for a given viewpoint in the genome. This enrichment is mediated by PCR of the locus of interest (R1) and R2 provides the corresponding contact from the proximity ligation step.

I will update you when I get the new data.

I played a bit with other arguments. Here a report:

chromap --preset hic -x test.index -r ref.fa -1 r1.fq -2 r2.fq -o out.sam --min-read-length 28 --split-alignment --MAPQ-threshold 0 -e 100 --max-num-best-mappings 10 --min-num-seeds 1
Preset parameters for Hi-C are used.
Start to map reads.
Parameters: error threshold: 100, min-num-seeds: 1, max-seed-frequency: 500,1000, max-num-best-mappings: 10, max-insert-size: 1000, MAPQ-threshold: 0, min-read-length: 28, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 1
Analyze bulk data.
Won't try to remove adapters on 3'.
Won't remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Allow split alignment.
Output mappings in pairs format.
Reference file: ref.fa
Index file: test.index
1th read 1 file: r1.fq
1th read 2 file: r2.fq
Output file: out.sam
Loaded all sequences successfully in 3.33s, number of sequences: 1924, number of bases: 1679214157.
Kmer size: 17, window size: 7.
Lookup table size: 177152077, occurrence table size: 309193718.
Loaded index successfully in 8.21s.
Mapped 500000 read pairs in 232.65s.
Mapped 164948 read pairs in 71.12s.
Mapped all reads in 304.79s.
Number of reads: 1329896.
Number of mapped reads: 0.
Number of uniquely mapped reads: 0.
Number of reads have multi-mappings: 0.
Number of candidates: 118718663.
Number of mappings: 0.
Number of uni-mappings: 0.
Number of multi-mappings: 0.
Segmentation fault

This is still using the conda build h9a82719_0 since I don't think the version without the segfault bug has been release there.

yuzhenpeng commented 2 years ago

I met the same problems. How deal with it.

haowenz commented 2 years ago

I met the same problems. How deal with it.

Can you provide the log output so that we might be able to understand what is happening?

yuzhenpeng commented 2 years ago

Hello, thanks for your response. I got a warning shown as follow. image

haowenz commented 2 years ago

Hello, thanks for your response. I got a warning shown as follow. image

Thanks for providing this. Can you also show us the command line you used?

yuzhenpeng commented 2 years ago

This is my command. chromap --preset hic -x index -r ../../../../guatoujing.contig.purged.fa -1 ../../fastq/hic_R1.fastq -2 ../../fastq/hic_R2.fastq --SAM -o aln.sam -t 20

haowenz commented 2 years ago

This is my command. chromap --preset hic -x index -r ../../../../guatoujing.contig.purged.fa -1 ../../fastq/hic_R1.fastq -2 ../../fastq/hic_R2.fastq --SAM -o aln.sam -t 20

Sorry I missed this as it is in an issue started by others rather than you. It would be helpful in the future if you start a separate issue for your problem so that I can receive a notification for your post here.

Since you are generating SAM output and you have 1.7 billion reads which is a lot. It is most likely that Chromap run out of the memory you have for generating SAM output. Can you tell us the amount of memory you have on the machine? Though Chromap supports SAM output at the moment, it is not initially designed to output SAM and it is not optimized to output SAM. Currently outputting SAM sometimes leads to more memory usage, which may cause the bad_alloc error you got. We are currently fixing this and trying to release a new version soon. For now there are several ways to resolve this. You can either generate output in pairs format, which uses less amount of memory. Or you split the reads into several files and map them separately and then merge the results. Or you can use a machine with larger amount of memory if that is available.

yuzhenpeng commented 2 years ago

Thanks for your suggestion. The memory of my machine is 1024 Gb. By the way, how to convert pairs format to SAM format. And, how to split the reads into several files and map them separately and then merge the results. Could you give more retails.

Thank you. @haowenz

haowenz commented 2 years ago

Thanks for your suggestion. The memory of my machine is 1024 Gb. By the way, how to convert pairs format to SAM format. And, how to split the reads into several files and map them separately and then merge the results. Could you give more retails.

Thank you. @haowenz

Sorry I missed your reply under this old resolved issue. I am about to close this resolved issue. You may start another if you still have the problem with Chromap. For your question, you can use runHiC (https://github.com/XiaoTaoWang/HiC_pipeline) developed by Xiaotao. It should help you make the conversion or split the reads.