jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

report ouput file is overwritten #73

Closed frederic-mahe closed 5 years ago

frederic-mahe commented 5 years ago

Hi, working with atropos v1.1.19, I am trying to append the trimming report to a pre-existing log file:

echo -e "test\n" > tmp.log
cmmd tmp.fastq.gz | \
    atropos trim \
            -g CCAGCASCYGCGGTAATTCC \
            --report-file /dev/stderr \
            --report-formats txt \
            --log-file /dev/null \
            --single-input - \
            --output /dev/null 2>> tmp.log
head tmp.log

returns:

=======
Atropos
=======
...   

where I was expecting:

test

=======
Atropos
=======
...   

Also, I got an error message when I try to write to non-seekable files (process substitutions), maybe the two problems are linked? (atropos seeks to the beginning of output files?)

jdidion commented 5 years ago

Thank you for the error report. Can you give an example of a command where you write to a process substitution?

frederic-mahe commented 5 years ago

Hum, I was not able to get the error message again when writing to a process substitution. I can get one, when reading from a process substitution though:

atropos trim -g "CCAGCASCYGCGGTAATTCC" --single-input <(cat tmp.fastq) --output /dev/null
Traceback (most recent call last):
  File "/home/mahe/.local/lib/python3.6/site-packages/atropos/commands/__init__.py", line 217, in execute_cli
    retcode, summary = command.execute(args)
  File "/home/mahe/.local/lib/python3.6/site-packages/atropos/commands/__init__.py", line 70, in execute
    retcode, summary = self.run_command(options)
  File "/home/mahe/.local/lib/python3.6/site-packages/atropos/commands/__init__.py", line 134, in run_command
    runner = runner_class(options)
  File "/home/mahe/.local/lib/python3.6/site-packages/atropos/commands/trim/__init__.py", line 282, in __init__
    super().__init__(options, TrimSummary)
  File "/home/mahe/.local/lib/python3.6/site-packages/atropos/commands/base.py", line 205, in __init__
    input_read=options.input_read, alphabet=options.alphabet)
  File "/home/mahe/.local/lib/python3.6/site-packages/atropos/io/seqio.py", line 919, in open_reader
    "colorspace), 'fasta', 'fastq', 'sam', or 'bam').".format(file_format))
atropos.io.seqio.UnknownFileType: File format None is unknown (expected 'sra-fastq' (only for colorspace), 'fasta', 'fastq', 'sam', or 'bam').

That's not a big deal, since this can be replaced with a simple pipe. The clobbering of log files I've reported above is a bit more annoying.

jdidion commented 5 years ago

Ok, this makes sense. If you do not specify the --format option, Atropos tries to guess the format from the file extension. This fails on a process substitution since the actual name is something like like '/dev/fd/63'. So the fix is just to add '--format fastq'.

The log file clobbering is interesting - I'm still trying to figure it out. It doesn't seem like an application should be able to do that, since you're using a unix-level redirect to concatenate the output to a file.

jdidion commented 5 years ago

Sorry, I cannot replicate your issue. When I run the same series of commands, I get tmp.log with "test" as the first line.

What OS are you using?

frederic-mahe commented 5 years ago

Hi, I made a toy example:

echo -e "test\n" > tmp.log
printf "@s\nCCAGCASCYGCGGTAATTCCAA\n+\nIIIIIIIIIIIIIIIIIIIIII\n" | \
    atropos trim \
            -g "CCAGCASCYGCGGTAATTCC" \
            --format fastq \
            --report-file /dev/stderr \
            --report-formats txt \
            --log-file /dev/null \
            --single-input - \
            --output - 2>> tmp.log | \
    atropos trim \
            -g "CCAGCASCYGCGGTAATTCC" \
            --format fastq \
            --report-file /dev/stderr \
            --report-formats txt \
            --log-file /dev/null \
            --single-input - \
            --output /dev/null 2>> tmp.log
head tmp.log

The log file is clobbered by the first atropos command, and then by the second. The final log file only contains the report from the last atropos command. That's puzzling.

I am running on a x86_64 GNU/Linux box (a node in a large cluster), I think the distribution is CentOS, the kernel is 3.10.0-693.21.1.el7.

frederic-mahe commented 5 years ago

Just to make sure, here is a toy example with another command:

echo "pass1" > tmp.log
(>&2 echo "pass2") 2>> tmp.log
(>&2 echo "pass3") 2>> tmp.log
cat tmp.log                                                                           

returns:

pass1
pass2
pass3

as expected.

frederic-mahe commented 5 years ago

Maybe something like this is happening inside atropos?

echo "pass1" > tmp.log
(>&2 echo "pass2" > /dev/stderr) 2>> tmp.log
cat tmp.log 

returns:

pass2

whereas:

echo "pass1" > tmp.log
(>&2 echo "pass2" >> /dev/stderr) 2>> tmp.log
cat tmp.log 

returns:

pass1
pass2
frederic-mahe commented 5 years ago

I've tested the issue on another cluster, with python 3.6.3 instead of 3.6.6 and I obtain the same results (i.e. atropos clobbers).

frederic-mahe commented 5 years ago

hi @jdidion were you able to test my toy-examples?

jdidion commented 5 years ago

Hi frederic, I just tried your toy example and again I cannot reproduce it. The log file I get when I run your example from Oct 17 is attached. tmp.log

The only think I can think is that this is a Mac vs Linux difference, or perhaps a shell difference. I am using bash, not sure what you're using.

frederic-mahe commented 5 years ago

I am using bash too (v4.4.12).

My hypothesis is that there is a nuance between MacOS's and Linux's fopen() function. I've been trying to change atropos's writing parameter: if writing to stdout and stderr were always set to a rather than to w, then the problem would be solved on Linux, without any side effect on MacOS I assume.

I was not able to spot the exact place where to do that change (atropos/io/__init__.py? I tried several changes but without any succes). Would it be possible for you to help me? If I knew where to apply the change, I could run some tests to check wether or not the issue is solved.

jdidion commented 5 years ago

stdout/stderr are not opened explicitly - python's sys.stdout/stderr are used (line 109).

On Nov 23, 2018, at 1:55 PM, Frédéric Mahé notifications@github.com wrote:

I am using bash too (v4.4.12).

My hypothesis is that there is a nuance between MacOS's and Linux's fopen() function. I've been trying to change atropos's writing parameter: if writing to stdout and stderr were always set to a rather than to w, then the problem would be solved on Linux, without any side effect on MacOS I assume.

I was not able to spot the exact place where to do that change (atropos/io/init.py? I tried several changes but without any succes). Would it be possible for you to help me? If I knew where to apply the change, I could run some tests to check wether or not the issue is solved.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

frederic-mahe commented 5 years ago

Hum, after much tinkering I was not able to change the behavior I observe on Linux. I think this is linked to the fact that stdout and stderr streams are block-buffered when not-interactive (see https://docs.python.org/3/library/sys.html#sys.stderr), but using /usr/bin/python -u does not change anything.

I give up for now, and I will use temporary files to store atropos' log results.