Open wwood opened 7 years ago
I should also note that this error seems quite rare - >95% of samples I've tried PEAT on go through without issue.
I try this data are ok, not any error. First, I use fastqdump SRR5240636.sra, Next split fastq to pair. Then run command line. Maybe you can send your data to me try it.
Thanks for the quick response. I re-downloaded the data and got the same error. My suspicion is some different way we are running the binary. Above, I was using one built using a Guix recipe. Unfortunately, running the pre-built binary gives me this error:
~/bioinfo/PEAT/bin/PEAT paired -1 ../fastq/SRR5240636_1.fastq.gz -2 ../fastq/SRR5240636_2.fastq.gz --output_1 /tmp/SRR5240636_1.fastq --output_2 /tmp/SRR5240636_2.fastq
/srv/whitlam/home/users/uqbwoodc/bioinfo/PEAT/bin/PEAT: error while loading shared libraries: libboost_filesystem.so.1.61.0: cannot open shared object file: No such file or directory
I could attempt to update boost (outside of Guix the libraries on the machine are becoming dated), but first can you confirm that you are getting the same md5's of the files so we can rule that out please?
> md5sum fastq_flat/SRR5240636_1.fastq.gz
4cecaba1d0b6ef04ae41788789d03a0c fastq_flat/SRR5240636_1.fastq.gz
> md5sum fastq_flat/SRR5240636_2.fastq.gz
112eacad6e1144fda956894b40022e4a fastq_flat/SRR5240636_2.fastq.gz
Thanks. Both of these files were generated by this kind of method, sunig newest sra-tools (2.8.2-1):
prefetch -X 500G --transport fasp SRR5240636
fastq-dump --gzip --split-3 SRR5240636.sra
I also tried running on uncompressed input data, same deal.
/gnu/store/bjjz2gc0w6xj9hily8v5x68af8w75lma-peat-1.2.4-1.2bb4a509/bin/PEAT paired -1 /tmp/SRR5240636_1.fastq -2 /tmp/SRR5240636_2.fastq --output_1 /tmp/SRR5240636_1.peat.fastq --ou
tput_2 /tmp/SRR5240636_2.peat.fastq
Segmentation fault
The md5sums of the uncompressed files:
$ md5sum /tmp/SRR5240636_1.fastq /tmp/SRR5240636_2.fastq
fd426dbac252006e6257d7e2b4f8dda5 /tmp/SRR5240636_1.fastq
4e215ddaaa4987856009c5acea0dbe1a /tmp/SRR5240636_2.fastq
@wwood do you mean that ~5% of samples fail when run through PEAT ?
@tseemann: yes of the 50 somewhat randomly chosen metagenomes, 2 segfault. These two are from the same study though IIRC. Are you observing differently unanticipated results?
I think PEAT does not expect p1 and p2 to have very different length or there are many reads that are very short before trimming, which is not what we typically see in our experience, we did not do metagenomics though. We are looking into this issue and should be able to fix it soon.
How do you split fastq? Our md5sum is different. md5sum file_name file_size 34e47f70ffbc2e31596e245fa7aad1c3 SRR5240636.sra 638244687 77cfc03a0475147a3b2d6679cb379baf SRR5240636.fastq 3651033992 9c8e1f5f12896778f9f0824e495417ee SRR5240636_1.fastq 2181672952 2c681c00998b26cf43a7acc18a25539e SRR5240636_2.fastq 2192218408 a61396b4b1bb5bed157cdef2f8ebbce0 SRR5240636_1.fastq.gz 499328004 1c6a05e83c2f0286b6bb0c0442d1e92c SRR5240636_2.fastq.gz 520014509
I only split seq & quality, like this:
head SRR5240636.fastq @SRR5240636.1 1 length=246 CGGTATCGACGCTCAGGCGATAGCGGCCAGGTGCGGGCACGTCGAAGCGCAACGCAGTGTCTTTCTCCAGCGTCTGCTTGCGGTATGCCGCCCGCGGCTGCGACCGTGCGGCGCCGGCGGTAGCACCGCCGGCGCCGCACGGTCGCAGCCGCTGGCGGCATACCGCCAGCAGACGCTGGAGAAAGACACTGCGTTGCGCTTCGCCGTTCCCGCACCTGGCCGCTATCGCCTGATCGTCGATACCGA +SRR5240636.1 1 length=246 BBBBBBFF/F</BFBBBFF/F/7FBFFFFBFFFFFFF//<BFFF/<FFFB/7B/B/BFFFFFFFFFFF//FFFF/<FFF/B77B<BFFF/FF/77/7BF<B//7BFFFFFFFFFFBFFFF<B/B/<BBBFFFFBFFFFFFFFFFFF/<FFFFFFBFFFF/B7FFBF/7<B/B7FFBBBB/B///7/<7BFFBBFFFFBFFFFF/FBBFFFFFF/7BFFFBFBBBFFFBFFFBFFFBBFF/7/BB//
head SRR5240636_1.fastq @SRR5240636.1 1 length=246 CGGTATCGACGCTCAGGCGATAGCGGCCAGGTGCGGGCACGTCGAAGCGCAACGCAGTGTCTTTCTCCAGCGTCTGCTTGCGGTATGCCGCCCGCGGCTGCGACCGTGCGGCGCCGGCGGTAG +SRR5240636.1 1 length=246 BBBBBBFF/F</BFBBBFF/F/7FBFFFFBFFFFFFF//<BFFF/<FFFB/7B/B/BFFFFFFFFFFF//FFFF/<FFF/B77B<BFFF/FF/77/7BF<B//7BFFFFFFFFFFBFFFF<B/
head SRR5240636_2.fastq @SRR5240636.1 1 length=246 CACCGCCGGCGCCGCACGGTCGCAGCCGCTGGCGGCATACCGCCAGCAGACGCTGGAGAAAGACACTGCGTTGCGCTTCGCCGTTCCCGCACCTGGCCGCTATCGCCTGATCGTCGATACCGA +SRR5240636.1 1 length=246 B/<BBBFFFFBFFFFFFFFFFFF/<FFFFFFBFFFF/B7FFBF/7<B/B7FFBBBB/B///7/<7BFFBBFFFFBFFFFF/FBBFFFFFF/7BFFFBFBBBFFFBFFFBFFFBBFF/7/BB//
Maybe you can give me your data, let me try.
I split using fastq-dump..
fastq-dump --gzip --split-3 SRR5240636.sra
This output 2 files, SRR5240636_1.fastq.gz
and SRR5240636_2.fastq.gz
. If you like, I can send them directly - is there an email address I can send a link to? You can reach me on email via http://ecogenomic.org/personnel/dr-ben-woodcroft if you like.
I run your command, get the same error.
The fastq-dump split rule is confusing. like this
line 38 head -38 SRR5240636_1.fastq | tail -n1 CGGTTCAGCAGGAATGCCGA head -38 SRR5240636_2.fastq | tail -n1 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAA
line 9871218 head -9871218 SRR5240636_1.fastq | tail -n1 ATCGGAAGAGCGGTTCAGAGGAATGCGAGATCGGAAAGCGGTTCAGCAGGAATGCCGAGACCGTGCTGCAATCTCGTATGCCGTCTTCTGCTTG head -9871218 SRR5240636_2.fastq | tail -n1 CGCATTCCTCTGAACCGCTCTTC
Other people also have this question. We will fix our tool . Thanks to your feedback. http://seqanswers.com/forums/showthread.php?t=25489
Hmm, interesting. How are you extracting and splitting the fastq file?
If you use Linux, you can use "awk".
awk '{ if (NR % 2 == 0) print substr( $0 , 1 , length($0)/2 ); else print $0 ;}' SRR5240636.fastq > SRR5240636_1.fastq
awk '{ if (NR % 2 == 0) print substr( $0 , (length($0)/2 )+1, length($0) ); else print $0 ;}' SRR5240636.fastq > SRR5240636_2.fastq
I also encountered this problem and I found it is caused by the parameters :--output_1 and --output_2. You may try to run with : ./bin/PEAT paired -1 ../fastq/SRR5240636_1.fastq.gz -2 ../fastq/SRR5240636_2.fastq.gz --output SRR5240636 --out_gzip -n 24 &>logs/SRR5240636.1.log.
It will generate SRR5240636_paired2.fq.gz and SRR5240636_paired2.fq.gz and SRR5240636_report.txt
So outputting separate R1 and R2 is broken, but outputting a single interleaved/paired works?
This should help @jhhung narrow the bug down. Nice!
To deinterleave:
seqtk seq -1 paired.fq.gz | pigz > R1.fq.gz
seqtk seq -2 paired.fq.gz | pigz > R2.fq.gz
@HeXY0515 @wwood Thank you all for the feedback and the bug reports. We're currently working on a newer version of PEAT, which fixes this bug. It'll be released soon.
Hi again.. I noticed that the master branch of PEAT (currently 2bb4a509) segfaults on two samples I've tried. Here's an excerpt from a GNU parallel run:
This doesn't appear to relate to
--out_gzip
or-n
because removing these flags gives the same error:Any ideas? Ta.