ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

[BUG] PGAP fills up /tmp #132

Closed MrTomRod closed 3 years ago

MrTomRod commented 3 years ago

We have encountered a problem with NCBI PGAP (2021-01-11.build5132).

After annotating about 20 bacterial genomes, our temporary directory (/tmp) filled up with about 30 GB of folders with names like 5kod846l, i.e. 8 random characters.

Until /tmp is full, the pipeline works well.

Running PGAP on the test genome (./pgap.py -r --no-internet -o mg37_results test_genomes/MG37/input.yaml) leads to 327 MB of temporary files.

Expected behavior PGAP removes its temporary files after exiting.

Software versions

Example /tmp content:

# after running ~ 20 genomes
$ls /tmp | wc -l
4446
# after running test (mg37_results)
$ du -h /tmp
327M    /tmp

$ls /tmp | wc -l
236

$ ls *
0fa6zpex:
cmsearch.asn
md5checksums.txt
ncbiapp.log
ncbiapp.perf

0nb5zlxw:
jobsaa.xml

10f3q_77:
cat.out

1fgzhhu_:
contam_in_prok_blastdb_dir
ncbiapp.log
ncbiapp.perf
output

1jb6ud5t:
blast.1.asn
ncbiapp.log
ncbiapp.perf

1jc9tcuc:
GMS2.mod
GeneMark_hmm_combined.mod
alignments.mft
annotation.mft
fasta-orig.fna
fasta.fna
fasta.fna.lst
genemark-input.gff
genemark-non-input.gff
log
ncbiapp.log
ncbiapp.perf
preliminary-models.asn
sequences.mft

1kew95pi:
ncbiapp.log
ncbiapp.perf
output
sequence_cache

1w21za3_:
ncbiapp.log
ncbiapp.perf
var_proc_annot_details.xml
var_proc_annot_stats.xml

1wx_qnkh:

1y48gl8w:
ncbiapp.log
ncbiapp.perf
sequences.text.asn

1zzhmyez:
cache
compartments.asn
ncbiapp.log
ncbiapp.perf
sequence_cache

24g11jnx:
A.mft
B.mft
ncbiapp.log
ncbiapp.perf
result.lst
MrTomRod commented 3 years ago

No, I ran exactly the command above (./pgap.py -r --no-internet -o mg37_results test_genomes/MG37/input.yaml). Did not edit the pgap.py script.

azat-badretdin commented 3 years ago

Right. I realized that after posting my comment and deleted it before seeing yours :-)

I am afraid this might be one of the peculiarities of cwltool behavior. If we do not pass storage info for tmp files (as we do with --debug mode) to cwltool call, it creates them in /tmp/ inside docker container, which is mapped from either $TMPDIR or /tmp/. It is not clear right now why cwltool does not delete these directories upon completing execution in your case.

I would suggest following banal workarounds:

MrTomRod commented 3 years ago

Haha, that's fine.

I think I will edit pgap.py as follows:

That should do the trick, right?

And it makes it easy to connect the input to the docker container name as well as the temp files.

I think that would be a slight improvement over the current script. Would you like to have it?

azat-badretdin commented 3 years ago

Thanks, that's a good idea.

That would work. We probably do not want to bother users with additional parameters just for the sake of deleting it. We can generate them internally and then delete them.

Again, running pgap.py --debug has its benefits. If it crashes you do not have to rerun to generate and send a report here: the files are already there, not deleted.

mdphan commented 3 years ago

I am having the exact same problem. My /tmp fills up very quickly. @MrTomRod could I please have your edited pgap.py? Many thanks.

MrTomRod commented 3 years ago

@mdphan: I ended up with an easier solution that does not require me to change pgap.py. We can simply set the env var TMPDIR. So this is how I run pgap:

export TMPDIR=/tmp/strain-123  # Now, all PGAP temporary data will end up there
mkdir $TMPDIR
./pgap.py ...  # run pgap as you would normally
sudo rm -rf $TMPDIR  # brute-remove the temporary data

Note: if you don't want to run sudo, you can do this to remove the folder at the end:

docker run -itv /tmp:/faketmp alpine:latest rm -rf /faketmp/strain-123

With docker, permissions mean nothing and everyone is admin. :rofl:

azat-badretdin commented 3 years ago

Thomas, although we generally agree that we should not be cleaning after cwltool your solution seems alright.

Would you like to submit a Pull Request?

Thanks for your contribution, Thomas!

azat-badretdin commented 3 years ago

ended up with an easier solution that does not require me to change pgap.py. We can simply set the env var TMPDIR

Yes, we introduced honoring TMPDIR settings earlier, at the request in a different issue.

MrTomRod commented 3 years ago

I ended up not changing pgap.py, so there is nothing to pull, unfortunately.

It's a simple-enough workaround until the cwltool issue is fixed. :)

azat-badretdin commented 3 years ago

Sounds good as well. Thank you Thomas!