ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
294 stars 89 forks source link

[BUG] When two input genomes have same basename, the first genome is mistakenly used for the second run. #273

Closed pvstodghill closed 3 weeks ago

pvstodghill commented 9 months ago

Describe the bug When two input genomes have same basename, the genome from the first run is mistakenly used for the second run. The cause is that for the first run, the first input genome (e.g., genome.fasta) is copied to the PGAP working directory, but for the second run, the second input genome is not copied and the first input genome is reused.

A related bug (untested): the use of fixed files names (input.yaml, submol.yaml) and user-supplied basenames as intermediate files in PGAP working directory. precluded concurrent execution of PGAP.

Another related bug (also untested, and now I'm just being silly): What happens if the path for my genome is "path/pgap.py"? :-)

To Reproduce

$ head -n1 a_in/genome.fasta b_in/genome.fasta
==> a_in/genome.fasta <==
>a
==> b_in/genome.fasta <==
>b
$ ./pgap.py -o a_out -g a_in/genome.fasta -s 'Genus species'
$ head -n1 genome.fasta a_out/annot.fna
==> genome.fasta <==
>a
==> a_out/annot.fna <==
>lcl|a
$ ./pgap.py -o b_out -g b_in/genome.fasta -s 'Genus species'
$ head -n1 genome.fasta b_out/annot.fna
==> genome.fasta <==
>a
==> b_out/annot.fna <==
>lcl|a
$ rm -f genome.fasta input.yaml submol.yaml
$ ./pgap.py -o b_out -g b_in/genome.fasta -s 'Genus species'
$ head -n1 genome.fasta b_out/annot.fna
==> genome.fasta <==
>b
==> b_out/annot.fna <==
>lcl|b

Expected behavior

The expected behavior is that the second input genome is used as input to the second PGAP run. This might be achieved by deleting the working files (genome.fasta, input.yaml, submol.yaml) from the PGAP working directory. This might be achieved by mktemp'ing a new directory within the PGAP working directory to contain the working files.

Software versions (please complete the following information):

azat-badretdin commented 9 months ago

Thank you for your report, Paul! That seems like a bug to me. We will look at it promptly.

azat-badretdin commented 9 months ago

What happens if the path for my genome is "path/pgap.py"? :-)

I love how your mind works, Paul! :-) We definitely need this attitude in our testing.

george-coulouris commented 9 months ago

Hey Paul! We have an open internal ticket to address the concurrency implications of pgap's use of fixed filenames. In the meantime, you can work around this by creating a temp dir and cd'ing to it before invoking pgap.

Say hi to Dave L. for me.

azat-badretdin commented 3 weeks ago

This has been fixed in our code and the fix will be available in the next release.