gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

bammerge: Accept wildcard regexp when parsing I= input #46

Open mmokrejs opened 7 years ago

mmokrejs commented 7 years ago

It is confusing that bammerge wants to write on STDOUT. I think is is a very bad idea because the shell redirect will buffer the data and for example if a target disk gets full during writing the buffer will keep growing until the kernel runs out of memory.

Please introduce O= flag.

Further, it would be nice if say I=dir/file_prefix.[0-9].bam was possible.

The error message seems funny.

$ bammerge level=9 index=1 I=ee_16AUT1C3/HM2YTCCXX.?.ee_16AUT1C3.bwa.sorted.bam IL=ee_16AUT1C3/HM2YTCCXX.ee_16AUT1C3.bwa.sorted. indexfilename=ee_16AUT1C3/HM2YTCCXX.ee_16AUT1C3.bwa.sorted.bam.bai > ee_16AUT1C3/HM2YTCCXX.ee_16AUT1C3.bwa.sorted.bam
PosixFdInput(ee_16AUT1C3/HM2YTCCXX.ee_16AUT1C3.bwa.sorted.,0): No such file or directory

/usr/lib64/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x5f)[0x7fffceaed4df]
bammerge(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x4128c0]
/usr/lib64/libmaus2.so.2(libmaus2::aio::PosixFdInput::PosixFdInput(std::string const&, int)+0x19d)[0x7fffcead629d]
/usr/lib64/libmaus2.so.2(libmaus2::aio::PosixFdInputStreamFactory::constructUnique(std::string const&)+0x272)[0x7fffcead6ce2]
bammerge(libmaus2::aio::InputStreamFactoryContainer::constructUnique(std::string const&)+0x51)[0x4197d1]
bammerge()[0x40fdd1]
bammerge()[0x40ce06]
/lib64/libc.so.6(__libc_start_main+0xf0)[0x7fffcd8ba280]
bammerge()[0x40d4da]
gt1 commented 7 years ago

What would make you think writing to stdout will buffer in case of a full filesystem? Anyway, consider a call like

strace -f bash -c "src/bammerge in.bam >out.bam"

which produces something like

...
[pid  8763] open("out.bam", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid  8763] dup2(3, 1)                  = 1
[pid  8763] close(3)                    = 0
[pid  8763] execve("src/bammerge", ["src/bammerge", "/home/tischler/data/bam/HG00096."...], [/* 78 vars */]) = 0
...

out.bam is opened as a regular file and the file descriptor is passed to bammerge as stdout. There is no difference in behaviour compared to the case that bammerge would have opened the file itself (except for libmaus2 not doing any extra buffering for the file).

As for wildcards this is already possible if you drop the I=, i.e. run

bammerge dir/file_prefix.[0-9].bam

and your shell will expand this. The bammerge man page states that each non key=value argument will be considered as an input file name.

The O key should be working in the next release (not released yet).

mmokrejs commented 7 years ago

What would make you think writing to stdout will buffer in case of a full filesystem?

I somehow kept this is my mind, I think experienced issues like this in the past (with some other tools). Maybe there was a pipe in between, like src/bammerge in.bam | foo >out.bam? There are tricks with pipefail and those "years" ago I didn't know about pipefail.

#! /bin/sh
set -o pipefail
# 'set -o pipefail' turns it on
# 'set +o pipefail' turns it off
... # some code

I do not have a concrete example now that a shell redirect would buffer the output. But were'nt there also ways to increase the size of a buffer used for shell redirect of pipe?

Anyway, I am happy that O= will be available, thanks.

       I=<[stdin]>: input filename, standard input if unset.

       O=<[stdout]>: output filename, standard output if unset.

At least I would suggest improving the I= doc string emphasizing that multiple input files can be used like bammerge dir/file_prefix.[0-9].bam. Even better if an EXAMPLES section would appear in the manpage.