ngless-toolkit / ngless

NGLess: NGS with less work
https://ngless.embl.de
Other
142 stars 24 forks source link

Broken pipe with samtools sort and multiple selects #92

Closed unode closed 5 years ago

unode commented 5 years ago

Current master dc769d7 fails with:

[Mon 29-10-2018 02:23:33]: Script OK. Starting interpretation...
[Mon 29-10-2018 02:23:33] Line 6: Running garbage collection.
[Mon 29-10-2018 02:23:33] Line 6: Interpreting [interpretIO]: input = samfile("input.bam")
[Mon 29-10-2018 02:23:33] Line 6: Interpreting [assignment]: samfile("input.bam")
[Mon 29-10-2018 02:23:33] Line 6: Interpreting [samfile]: NGOString "input.bam"
[Mon 29-10-2018 02:23:33] Line 8: Running garbage collection.
[Mon 29-10-2018 02:23:33] Line 8: Interpreting [interpretIO]: input = select(Lookup 'input' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 02:23:33] Line 8: Interpreting [assignment]: select(Lookup 'input' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 02:23:33] Line 9: Running garbage collection.
[Mon 29-10-2018 02:23:33] Line 9: Interpreting [interpretIO]: input = select(Lookup 'input' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 02:23:33] Line 9: Interpreting [assignment]: select(Lookup 'input' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 02:23:33] Line 11: Running garbage collection.
[Mon 29-10-2018 02:23:33] Line 11: Interpreting [interpretIO]: input = samtools_sort(Lookup 'input' as NGLMappedReadSet; __output_bam=True; by={name})
[Mon 29-10-2018 02:23:33] Line 11: Interpreting [assignment]: samtools_sort(Lookup 'input' as NGLMappedReadSet; __output_bam=True; by={name})
[Mon 29-10-2018 02:23:33] Line 11: Interpreting [executing module function: 'samtools_sort']: NGOMappedReadSet {nglgroupName = "input.bam", nglSamFile = <STREAM>, nglReference = Nothing}
[Mon 29-10-2018 02:23:33] Line 11: Created & opened temporary file /tmp/sorted_selected_selected_input.23787-0.bam
[Mon 29-10-2018 02:23:33] Line 11: Calling binary /home/u/system/apps/ngless/bin/../share/ngless/bin/ngless-0.9.1-samtools with args: sort -n -@ 1 -O bam -T /tmp/samtools_sort_temp.tmp23787/samruntmp
[Mon 29-10-2018 02:23:33] Line 11: Starting samtools view of input.bam
Exiting after internal error. If you can reproduce this issue, please run your script with the --trace flag and report a bug at http://github.com/ngless-toolkit/ngless/issues
fd:13: hPutBuf: resource vanished (Broken pipe)

when using:

ngless "0.8"
import "parallel" version "0.6"
import "mocat" version "0.0"
import "samtools" version "0.0"

input = samfile("input.bam")

input = select(input, keep_if=[{mapped}])
input = select(input, keep_if=[{mapped}])

input = samtools_sort(input, by={name})
write(input, ofile='namesorted.bam')

and a sufficiently large input.bam file.

I tried reproducing this with the .bam files we have in the repository but none was big enough to trigger the broken pipe.

I managed to reproduce it locally with a 13MB bam file generated with:

ngless "0.8"
import "parallel" version "0.6"
import "mocat" version "0.0"
import "samtools" version "0.0"

samples = readlines(ARGV[3])
sample = lock1([ARGV[1]])
input = load_mocat_sample(ARGV[2] + '/' + sample)

mapped = map(input, fafile=ARGV[4], mode_all=True)
mapped = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=97, action={unmatch})

write(mapped, ofile='outputs/' + sample + '.test.bam')
unode commented 5 years ago

While trying to workaround the problem above I ran into a different issue but the error above may be related to the same problem.

Using:

ngless "0.8"
import "parallel" version "0.6"
import "mocat" version "0.0"
import "samtools" version "0.0"

samples = readlines(ARGV[3])
sample = lock1([ARGV[1]])
input = load_mocat_sample(ARGV[2] + '/' + sample)

mapped = map(input, fafile=ARGV[4], mode_all=True)
mapped = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=97, action={unmatch})

mapped_non_unique = select(mapped, keep_if=[{mapped}])

# workaround
write(mapped_non_unique, ofile='outputs/' + sample + '.test.bam')
mapped_non_unique = samfile('outputs/' + sample + '.test.bam')
# workaround

mapped_non_unique = samtools_sort(mapped_non_unique, by={name})
write(mapped_non_unique, ofile='outputs/' + sample + '.namesorted.bam')

collect(qcstats({fastq}), ofile='outputs/fqstats.txt', current=sample, allneeded=samples)
collect(qcstats({mapping}), ofile='outputs/mapstats.txt', current=sample, allneeded=samples)

I got:

...
[Mon 29-10-2018 03:03:45] Line 10: Finished mapping to db.fna
[Mon 29-10-2018 03:03:45] Line 10: Finished mapping to db.fna
[Mon 29-10-2018 03:03:45] Line 10: Total reads: 95047
[Mon 29-10-2018 03:03:45] Line 10: Total reads aligned: 54160 [56.98%]
[Mon 29-10-2018 03:03:45] Line 10: Total reads Unique map: 38849 [40.87%]
[Mon 29-10-2018 03:03:45] Line 10: Total reads Non-Unique map: 15311 [16.11%]
[Mon 29-10-2018 03:03:45] Line 11: Running garbage collection.
[Mon 29-10-2018 03:03:45] Line 11: Interpreting [interpretIO]: mapped = select(Lookup 'mapped' as NGLMappedReadSet)using {Block {blockVariable = [Variable "mr"], blockBody = Sequence [mr = (Lookup 'mr' as NGLMappedRead).MethodName {unwrapMethodName = "filter"}( Nothing; min_match_size=45; min_identity_pc=97; action={unmatch} )]}}
[Mon 29-10-2018 03:03:45] Line 11: Interpreting [assignment]: select(Lookup 'mapped' as NGLMappedReadSet)using {Block {blockVariable = [Variable "mr"], blockBody = Sequence [mr = (Lookup 'mr' as NGLMappedRead).MethodName {unwrapMethodName = "filter"}( Nothing; min_match_size=45; min_identity_pc=97; action={unmatch} )]}}
[Mon 29-10-2018 03:03:45] Line 11: Executing blocked select on file /tmp/23095980/mapped_db.27976-0.sam
[Mon 29-10-2018 03:03:45] Line 11: Created & opened temporary file /tmp/23095980/block_selected_mapped_db27976-1.sam
[Mon 29-10-2018 03:03:46] Line 14: Running garbage collection.
[Mon 29-10-2018 03:03:46] Line 14: Removing temporary file: /tmp/23095980/mapped_db.27976-0.sam
[Mon 29-10-2018 03:03:46] Line 14: Interpreting [interpretIO]: mapped_non_unique = select(Lookup 'mapped' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 03:03:46] Line 14: Interpreting [assignment]: select(Lookup 'mapped' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 03:03:46] Line 17: Running garbage collection.
[Mon 29-10-2018 03:03:46] Line 17: Removing temporary file: /tmp/23095980/mapped_db.27976-0.sam
[Mon 29-10-2018 03:03:46] Line 17: Interpreting [interpretIO]: write(Lookup 'mapped_non_unique' as NGLMappedReadSet; __hash="96241298d14e84a92cabbadc8bc7d00a"; ofile="outputs/"BOpAddLookup 'sample' as NGLStringBOpAdd".test.bam")
[Mon 29-10-2018 03:03:46] Line 17: Interpreting [write]: NGOMappedReadSet {nglgroupName = "data/sample", nglSamFile = <STREAM>, nglReference = Nothing}
[Mon 29-10-2018 03:03:46] Line 17: Created & opened temporary file /tmp/23095980/selected_block_selected_mapped_db27976-1streamed_.27976-2.sam
[Mon 29-10-2018 03:03:46] Line 17: Finished mapping to select.lno14
[Mon 29-10-2018 03:03:46] Line 17: Total reads: 35647
[Mon 29-10-2018 03:03:46] Line 17: Total reads aligned: 35647 [100.00%]
[Mon 29-10-2018 03:03:46] Line 17: Total reads Unique map: 27792 [77.96%]
[Mon 29-10-2018 03:03:46] Line 17: Total reads Non-Unique map: 7855 [22.04%]
Exiting after fatal error while loading and running script
System Error
Failure on converting sam to bam1

[Mon 29-10-2018 03:03:46] Line 17: Created & opened temporary file /tmp/23095980/converted_selected_block_selected_mapped_db27976-1streamed_27976-3.bam
[Mon 29-10-2018 03:03:46] Line 17: SAM->BAM Conversion start ('/tmp/23095980/selected_block_selected_mapped_db27976-1streamed_.27976-2.sam' -> '/tmp/23095980/converted_selected_block_selected_mapped_db27976-1streamed_27976-3.bam')
[Mon 29-10-2018 03:03:46] Line 17: Message from samtools: [E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] Parse error at line 43804
[main_samview] truncated file.
luispedro commented 5 years ago

It's actually not related to large BAM files at all, it's just a corner case that doesn't happen that frequently (something like 1 out of 50,000 sequences)

unode commented 5 years ago

In the case of the broken pipe above, is there any way that ngless can still capture and display the error? It seems like the actual message is lost and the only diagnostic is the broken pipe.

Thanks for the fix.