ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

[BUG] PGAP doesn't pull latest version of docker image (ncbi/pgap-utils:2021-05-19.build5429) #152

Closed rmzelle closed 3 years ago

rmzelle commented 3 years ago

Describe the bug I'd like to use the latest docker image 2021-05-19.build5429 that was posted yesterday (https://hub.docker.com/r/ncbi/pgap-utils/tags?page=1&ordering=last_updated), but when I run PGAP it doesn't seem to pull down this image.

To Reproduce

$ ./pgap.py -V
PGAP version 2021-01-11.build5132 is up to date.
$ ./pgap.py -u
PGAP version 2021-01-11.build5132 is up to date.

Look in cwltool.log after running ./pgap.py -r -o mg37_results test_genomes/MG37/input.yaml --taxcheck-only:

--- Start Runtime Report ---
{
    ...
    "Docker image": "ncbi/pgap:2021-01-11.build5132",
    ...
}
--- End Runtime Report ---

Expected behavior I expect pgap.py to download and use the new "2021-05-19.build5429" docker image.

Software versions (please complete the following information):

Log Files Please rerun pgap.py with the --debug flag and attach an archive (e.g. zip or tarball) of the logs in the directory: debug/tmp-outdir/*/*.log.

Additional context Add any other context about the problem here.

azat-badretdin commented 3 years ago

I'd like to use the latest docker image 2021-05-19.build5429 that was posted yesterday (https://hub.docker.com/r/ncbi/pgap-utils/tags?page=1&ordering=last_updated)

Right. Alas, this is the part of the release process where we are a bit ahead of ourselves. It was not yet officially released which can happen any hour now. Let's wait until this process is complete to make sure we are not mudding your problem with additional release issues.

azat-badretdin commented 3 years ago

We just released it now. See https://github.com/ncbi/pgap/releases.

Could you please try again?

azat-badretdin commented 3 years ago

Just checked from a fresh directory:


[badrazat@ip-172-16-120-227 May-2021-release]$ curl -OL https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   130    0   130    0     0   1547      0 --:--:-- --:--:-- --:--:--  1547
100 35972  100 35972    0     0   210k      0 --:--:-- --:--:-- --:--:--  210k
[badrazat@ip-172-16-120-227 May-2021-release]$ chmod +x pgap.py
[badrazat@ip-172-16-120-227 May-2021-release]$ ./pgap.py -V
The latest version of PGAP is 2021-05-19.build5429, you have nothing installed locally.
rmzelle commented 3 years ago

Ah, my bad. The update is indeed working now. I had a brief look at pgap.py before and it looked at first glance like it would always try to pull the most recent version of the Docker image.

Is PGAP on a regular release cycle, by the way? 2021-01-11.build5132 didn't seem to contain any genomes for Enterococcus lactis (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=357441), and I have been F5-ing the PGAP release page for a few weeks to see if there was an update available 😄. 2021-05-19.build5429 now has two genomes for this (tentative) species (GCA_015904215.1 and GCA_015904215.1, although surprisingly not the representative genome GCA_015767715.1).

There are a lot of recent changes in bacterial taxonomy, and in the availability of reference genomes, so these Docker images can rather quickly get out of date when used for taxonomic assignment. That said, awesome tool, much appreciated.

azat-badretdin commented 3 years ago

2021-05-19.build5429 now has two genomes for this (tentative) species (GCA_015904215.1 and GCA_015904215.1, although surprisingly not the representative genome GCA_015767715.1).

Could you please specify the particular resource you are focusing on?

rmzelle commented 3 years ago

Could you please specify the particular resource you are focusing on?

Sorry, I'm not sure what type of resource you're referring to here. Can you please clarify?

Below the output of a PGAP taxonomy check of a particular input genome sequence, which showed that 2021-01-11.build5132 doesn't seem to have any E. lactis reference genomes, whereas 2021-05-19.build5429 now has two:

2021-01-11.build5132 (ani-tax-report.txt):

ANI report for assembly: my_gc_assm_name Submitted organism: Mycoplasma genitalium (taxid = 2097, rank = species, lineage = Bacteria; Tenericutes; Mollicutes; Mycoplasmataceae; Mycoplasma) Predicted organism: Enterococcus faecium (taxid = 1352, rank = species, lineage = Bacteria; Firmicutes; Bacilli; Lactobacillales; Enterococcaceae; Enterococcus) Submitted organism has type: Yes Status: INCONCLUSIVE Confidence: LOW 95.014 (82.4 86.6) 6994788 assembly Enterococcus faecium (GCA_900447735.1, 43941_G01) 94.906 (79.5 89.7) 2855408 assembly Enterococcus faecium NBRC 100486 (GCA_001544255.1, ASM154425v1) ...

2021-05-19.build5429:

ANI report for assembly: my_gc_assm_name Submitted organism: Enterococcus faecium (taxid = 1352, rank = species, lineage = Bacteria; Firmicutes; Bacilli; Lactobacillales; Enterococcaceae; Enterococcus) Predicted organism: Enterococcus lactis (taxid = 357441, rank = species, lineage = Bacteria; Firmicutes; Bacilli; Lactobacillales; Enterococcaceae; Enterococcus) Submitted organism has type: Yes Status: MISASSIGNED Confidence: HIGH 98.199 (85.3 82.0) 23730198 assembly 1799085 Enterococcus lactis (GCA_015904215.1, ASM1590421v1) 97.994 (83.4 90.8) 23550258 assembly 75249 Enterococcus lactis (GCA_015751065.1, ASM1575106v1) 95.014 (82.4 86.6) 6994788 assembly 567864 Enterococcus faecium (GCA_900447735.1, 43941_G01) 94.906 (79.5 89.7) 2855408 assembly 3982 Enterococcus faecium NBRC 100486 (GCA_001544255.1, ASM154425v1) ...

When I visit https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/enterococcus%20lactis and filter by "RefSeq category" = "representative", the only search result is GCA_015767715.1:

image

azat-badretdin commented 3 years ago

Thanks, that's helpful!

thibaudnis commented 3 years ago

The taxonomy check is done with the Average Nucleotide Identity (ANI) tool and uses the type material assemblies that are in GenBank (see our wiki page, or this publication for more details). The species representative assembly set is selected independently of type material assemblies. Representative assemblies are chosen based their quality (contiguity, number of genes, number of pseudogenes, size, etc). You can read more about representatives here. In the case of Enterococcus lactis, GCA_015904215.1 and GCA_015751065.1 are type material assemblies, but GCA_015767715.1 was chosen as the representative, because its chromosome is fully assembled (and GCA_015904215.1 and GCA_015751065.1's are not). I hope this explanation helps!

rmzelle commented 3 years ago

The taxonomy check is done with the Average Nucleotide Identity (ANI) tool and uses the type material assemblies that are in GenBank

Ah, thanks for the clarification.