segmentation fault on some SRA samples

wwood commented 7 years ago

Hi again.. I noticed that the master branch of PEAT (currently 2bb4a509) segfaults on two samples I've tried. Here's an excerpt from a GNU parallel run:

/bin/bash: line 1: 129774 Segmentation fault      /gnu/store/bjjz2gc0w6xj9hily8v5x68af8w75lma-peat-1.2.4-1.2bb4a509/bin/PEAT paired -1 ../fastq/SRR5240636_1.fastq.gz -2 ../fastq/SRR5240636_2.fastq.gz --output_1
SRR5240636_1.fastq.gz --output_2 SRR5240636_2.fastq.gz --out_gzip -n 24 &>logs/SRR5240636.1.log

This doesn't appear to relate to --out_gzip or -n because removing these flags gives the same error:

$ /gnu/store/bjjz2gc0w6xj9hily8v5x68af8w75lma-peat-1.2.4-1.2bb4a509/bin/PEAT paired -1 ../fastq/SRR5240636_1.fastq.gz -2 ../fastq/SRR5240636_2.fastq.gz --output_1 /tmp/SRR5240636_1.fastq.gz --output_2 /tmp/SRR5240636_2.fastq
gz decompress call
gz decompress call
Segmentation fault

Any ideas? Ta.

wwood commented 7 years ago

I should also note that this error seems quite rare - >95% of samples I've tried PEAT on go through without issue.

karta9812137 commented 7 years ago

I try this data are ok, not any error. First, I use fastqdump SRR5240636.sra, Next split fastq to pair. Then run command line. Maybe you can send your data to me try it.

(https://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP099/SRP099122/SRR5240636/)

wwood commented 7 years ago

Thanks for the quick response. I re-downloaded the data and got the same error. My suspicion is some different way we are running the binary. Above, I was using one built using a Guix recipe. Unfortunately, running the pre-built binary gives me this error:

~/bioinfo/PEAT/bin/PEAT paired -1 ../fastq/SRR5240636_1.fastq.gz -2 ../fastq/SRR5240636_2.fastq.gz --output_1 /tmp/SRR5240636_1.fastq --output_2 /tmp/SRR5240636_2.fastq
/srv/whitlam/home/users/uqbwoodc/bioinfo/PEAT/bin/PEAT: error while loading shared libraries: libboost_filesystem.so.1.61.0: cannot open shared object file: No such file or directory

I could attempt to update boost (outside of Guix the libraries on the machine are becoming dated), but first can you confirm that you are getting the same md5's of the files so we can rule that out please?

> md5sum fastq_flat/SRR5240636_1.fastq.gz
4cecaba1d0b6ef04ae41788789d03a0c  fastq_flat/SRR5240636_1.fastq.gz
> md5sum fastq_flat/SRR5240636_2.fastq.gz
112eacad6e1144fda956894b40022e4a  fastq_flat/SRR5240636_2.fastq.gz

Thanks. Both of these files were generated by this kind of method, sunig newest sra-tools (2.8.2-1):

prefetch -X 500G --transport fasp SRR5240636
fastq-dump --gzip --split-3 SRR5240636.sra

I also tried running on uncompressed input data, same deal.

/gnu/store/bjjz2gc0w6xj9hily8v5x68af8w75lma-peat-1.2.4-1.2bb4a509/bin/PEAT paired -1 /tmp/SRR5240636_1.fastq -2 /tmp/SRR5240636_2.fastq --output_1 /tmp/SRR5240636_1.peat.fastq --ou
tput_2 /tmp/SRR5240636_2.peat.fastq                                                                                                                                                                                
Segmentation fault

The md5sums of the uncompressed files:

$ md5sum /tmp/SRR5240636_1.fastq /tmp/SRR5240636_2.fastq
fd426dbac252006e6257d7e2b4f8dda5  /tmp/SRR5240636_1.fastq
4e215ddaaa4987856009c5acea0dbe1a  /tmp/SRR5240636_2.fastq

tseemann commented 7 years ago

@wwood do you mean that ~5% of samples fail when run through PEAT ?

wwood commented 7 years ago

@tseemann: yes of the 50 somewhat randomly chosen metagenomes, 2 segfault. These two are from the same study though IIRC. Are you observing differently unanticipated results?

jhhung commented 7 years ago

I think PEAT does not expect p1 and p2 to have very different length or there are many reads that are very short before trimming, which is not what we typically see in our experience, we did not do metagenomics though. We are looking into this issue and should be able to fix it soon.

karta9812137 commented 7 years ago

How do you split fastq? Our md5sum is different. md5sum file_name file_size 34e47f70ffbc2e31596e245fa7aad1c3 SRR5240636.sra 638244687 77cfc03a0475147a3b2d6679cb379baf SRR5240636.fastq 3651033992 9c8e1f5f12896778f9f0824e495417ee SRR5240636_1.fastq 2181672952 2c681c00998b26cf43a7acc18a25539e SRR5240636_2.fastq 2192218408 a61396b4b1bb5bed157cdef2f8ebbce0 SRR5240636_1.fastq.gz 499328004 1c6a05e83c2f0286b6bb0c0442d1e92c SRR5240636_2.fastq.gz 520014509

I only split seq & quality, like this:

head SRR5240636.fastq @SRR5240636.1 1 length=246 CGGTATCGACGCTCAGGCGATAGCGGCCAGGTGCGGGCACGTCGAAGCGCAACGCAGTGTCTTTCTCCAGCGTCTGCTTGCGGTATGCCGCCCGCGGCTGCGACCGTGCGGCGCCGGCGGTAGCACCGCCGGCGCCGCACGGTCGCAGCCGCTGGCGGCATACCGCCAGCAGACGCTGGAGAAAGACACTGCGTTGCGCTTCGCCGTTCCCGCACCTGGCCGCTATCGCCTGATCGTCGATACCGA +SRR5240636.1 1 length=246 BBBBBBFF/F</BFBBBFF/F/7FBFFFFBFFFFFFF//<BFFF/<FFFB/7B/B/BFFFFFFFFFFF//FFFF/<FFF/B77B<BFFF/FF/77/7BF<B//7BFFFFFFFFFFBFFFF<B/B/<BBBFFFFBFFFFFFFFFFFF/<FFFFFFBFFFF/B7FFBF/7<B/B7FFBBBB/B///7/<7BFFBBFFFFBFFFFF/FBBFFFFFF/7BFFFBFBBBFFFBFFFBFFFBBFF/7/BB//

head SRR5240636_1.fastq @SRR5240636.1 1 length=246 CGGTATCGACGCTCAGGCGATAGCGGCCAGGTGCGGGCACGTCGAAGCGCAACGCAGTGTCTTTCTCCAGCGTCTGCTTGCGGTATGCCGCCCGCGGCTGCGACCGTGCGGCGCCGGCGGTAG +SRR5240636.1 1 length=246 BBBBBBFF/F</BFBBBFF/F/7FBFFFFBFFFFFFF//<BFFF/<FFFB/7B/B/BFFFFFFFFFFF//FFFF/<FFF/B77B<BFFF/FF/77/7BF<B//7BFFFFFFFFFFBFFFF<B/

head SRR5240636_2.fastq @SRR5240636.1 1 length=246 CACCGCCGGCGCCGCACGGTCGCAGCCGCTGGCGGCATACCGCCAGCAGACGCTGGAGAAAGACACTGCGTTGCGCTTCGCCGTTCCCGCACCTGGCCGCTATCGCCTGATCGTCGATACCGA +SRR5240636.1 1 length=246 B/<BBBFFFFBFFFFFFFFFFFF/<FFFFFFBFFFF/B7FFBF/7<B/B7FFBBBB/B///7/<7BFFBBFFFFBFFFFF/FBBFFFFFF/7BFFFBFBBBFFFBFFFBFFFBBFF/7/BB//

Maybe you can give me your data, let me try.

wwood commented 7 years ago

I split using fastq-dump..

fastq-dump --gzip --split-3 SRR5240636.sra

This output 2 files, SRR5240636_1.fastq.gz and SRR5240636_2.fastq.gz. If you like, I can send them directly - is there an email address I can send a link to? You can reach me on email via http://ecogenomic.org/personnel/dr-ben-woodcroft if you like.

karta9812137 commented 7 years ago

I run your command, get the same error.

The fastq-dump split rule is confusing. like this

line 38 head -38 SRR5240636_1.fastq | tail -n1 CGGTTCAGCAGGAATGCCGA head -38 SRR5240636_2.fastq | tail -n1 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAA

line 9871218 head -9871218 SRR5240636_1.fastq | tail -n1 ATCGGAAGAGCGGTTCAGAGGAATGCGAGATCGGAAAGCGGTTCAGCAGGAATGCCGAGACCGTGCTGCAATCTCGTATGCCGTCTTCTGCTTG head -9871218 SRR5240636_2.fastq | tail -n1 CGCATTCCTCTGAACCGCTCTTC

Other people also have this question. We will fix our tool . Thanks to your feedback. http://seqanswers.com/forums/showthread.php?t=25489

wwood commented 7 years ago

Hmm, interesting. How are you extracting and splitting the fastq file?

karta9812137 commented 7 years ago

If you use Linux, you can use "awk".

awk '{ if (NR % 2 == 0) print substr( $0 , 1 , length($0)/2 ); else print $0 ;}' SRR5240636.fastq > SRR5240636_1.fastq

awk '{ if (NR % 2 == 0) print substr( $0 , (length($0)/2 )+1, length($0) ); else print $0 ;}' SRR5240636.fastq > SRR5240636_2.fastq

HeXY0515 commented 5 years ago

I also encountered this problem and I found it is caused by the parameters :--output_1 and --output_2. You may try to run with : ./bin/PEAT paired -1 ../fastq/SRR5240636_1.fastq.gz -2 ../fastq/SRR5240636_2.fastq.gz --output SRR5240636 --out_gzip -n 24 &>logs/SRR5240636.1.log.

It will generate SRR5240636_paired2.fq.gz and SRR5240636_paired2.fq.gz and SRR5240636_report.txt

tseemann commented 5 years ago

So outputting separate R1 and R2 is broken, but outputting a single interleaved/paired works?

This should help @jhhung narrow the bug down. Nice!

To deinterleave:

seqtk seq -1 paired.fq.gz | pigz > R1.fq.gz
seqtk seq -2 paired.fq.gz | pigz > R2.fq.gz

PhoebeWangintw commented 5 years ago

@HeXY0515 @wwood Thank you all for the feedback and the bug reports. We're currently working on a newer version of PEAT, which fixes this bug. It'll be released soon.

jhhung / PEAT

segmentation fault on some SRA samples #33