ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

PGAP (2022-12-13) permanentFail #257

Closed adriangeerre closed 1 year ago

adriangeerre commented 1 year ago

Describe the bug I am currently facing this same issue as in #245. I have a pipeline which used to work perfectly (implemented 6-8 months ago) but nowadays it stops the annotations with the error "WARNING Final process status is permanentFail".

To Reproduce I have tested the installation in 2 different systems, HPC and laptop, using Singularity and Docker with the MG37 test input and it would:

Expected behavior I would expect the normal behavior of the test run.

Software versions (please complete the following information):

Log Files The cwltool.log failed with "permanentFail". cwltool.log cwltool.failed_step.log

Additional context The folder ".pgap" is a link to another folder which contains the data required by PGAP.

azat-badretdin commented 1 year ago

The format of your input YAML file is incorrect, see this

line 1: "taxon": unexpected member, should be one of: "strain" "genus_species"  ( at JsonValue.organism)

in cwltool.failed_step.log

adriangeerre commented 1 year ago

Sorry, I forgot to add my execution line: python ~/programas/PGAP/pgap.py --debug -r -o mg37_results test_genomes/MG37/input.yaml

I used the MG37 input.yaml from the test genomes, I download them yesterday. input.zip

I have tried also with the file "test_genomes/GCA_000009765/input.yaml" and I obtained the same result. UnexpectedMember() --- line 1: "taxon": unexpected member, should be one of: "strain" "genus_species"

What can I do? Thanks for the help!

azat-badretdin commented 1 year ago

I have tried also with the file "test_genomes/GCA_000009765/input.yaml" and I obtained the same result.

See https://github.com/ncbi/pgap/wiki/Input-Files#metadata-yaml-file-submol

adriangeerre commented 1 year ago

I see and I think I got your point. I thought the input and submol where ready to use. Instead of adapting those files for MG37, I have swap to the genome and files from a previous successful run (I will call it Bact). I am currently testing this. Thank you again. I hope it works!

azat-badretdin commented 1 year ago

They are ready to use. I am not sure where did you get the file with "taxon:"

azat-badretdin commented 1 year ago

It could be old files from previous installations.

adriangeerre commented 1 year ago

I got the link from the installation instructions in the wiki and I run: wget https://s3.amazonaws.com/pgap-data/test_genomes.tgz

azat-badretdin commented 1 year ago

I just tested this tarball, it does not have any files with the word "taxon" either.

adriangeerre commented 1 year ago

That's weird, I can see the word taxon in the submol of all the test genomes that I just downloaded (again). Here are the steps I just did:

$ wget https://s3.amazonaws.com/pgap-data/test_genomes.tgz
--2023-05-17 22:42:02--  https://s3.amazonaws.com/pgap-data/test_genomes.tgz
Resolving proxy-default (proxy-default)... 10.220.0.1
Connecting to proxy-default (proxy-default)|10.220.0.1|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 19691644 (19M) [binary/octet-stream]
Saving to: ‘test_genomes.tgz’

100%[===========================================>] 19,691,644  13.9MB/s   in 1.4s   

2023-05-17 22:42:04 (13.9 MB/s) - ‘test_genomes.tgz’ saved [19691644/19691644]

$ ls -l
total 19231
-rw-rw-r-- 1 agomez CCRP_Data 19691644 Mar  8  2019 test_genomes.tgz

$ tar -xzf test_genomes.tgz

$ grep -i taxon test_genomes/*/submol.yaml 
test_genomes/GCA_000009765/submol.yaml:    taxon:  227882
test_genomes/GCA_000166555/submol.yaml:    taxon:  913090
test_genomes/GCA_000167475/submol.yaml:    taxon:  307502
test_genomes/GCA_000181555/submol.yaml:    taxon:  445983
test_genomes/GCA_000186345/submol.yaml:    taxon:  575540
test_genomes/GCA_000710235/submol.yaml:    taxon:  623
test_genomes/MG37/submol.yaml:    taxon:  243273
test_genomes/SAMN07633424/submol.yaml:    taxon:  197
test_genomes/SAMN09729021/submol.yaml:    taxon:  283734
test_genomes/SAMN09768125/submol.yaml:    taxon:  630
test_genomes/SAMN09783348/submol.yaml:    taxon:  197
test_genomes/SAMN09828454/submol.yaml:    taxon:  562
test_genomes/SAMN09831750/submol.yaml:    taxon:  1354
test_genomes/SAMN09831988/submol.yaml:    taxon:  623
test_genomes/SAMN09837224/submol.yaml:    taxon:  1639
test_genomes/SAMN09838637/submol.yaml:    taxon:  670
test_genomes/SAMN09839044/submol.yaml:    taxon:  28901
adriangeerre commented 1 year ago

Nonetheless, using the genome that I previously annotated, I was able to make it run inside an HPC using an srun session (It did not finished because I needed to cut the live session). However, when sending a job to the SLURM queue in the same HPC environment the job crashes within seconds and report the message taskset: failed to set pid 0's affinity: Invalid argument (which is not my reported issue but an step before). I found that the issue #202 already discussed about it and I might agree that Singularity have an odd behavior which could be causing weird and multiple errors.

Thanks for the help, again, and sorry for the chaotic feedback.

azat-badretdin commented 1 year ago

That's weird, I can see the word taxon in the submol of all the test genomes that I just downloaded (again). Here are the steps I just did:

You are right and I was wrong (I made a typo) . Indeed, that tarball contains submol examples with taxon: - outdated format.

That tarball is obsolete and we need to fix our documentation. Meanwhile, the test genomes are part of the installation that goes to dedicated PGAP installation directory, see https://github.com/ncbi/pgap/wiki/Quick-Start#quick-start

Install the pipeline. By default it will install in $HOME/.pgap, but this location can be changed by setting an environmental variable PGAP_INPUT_DIR

That's where you will find up-to-date test genomes.

Thanks for patiently pushing this issue, @adriangeerre !

azat-badretdin commented 1 year ago

I got the link from the installation instructions in the wiki and I run: wget https://s3.amazonaws.com/pgap-data/test_genomes.tgz

I am having trouble finding installation reference to the tarball. Could you please post a URL?

adriangeerre commented 1 year ago

I found the link in the installation section of the wiki https://github.com/ncbi/pgap/wiki/Installation. Right at the bottom, in the section Running the pipeline on a test genome, there is a link (our test genome archive). That is the link I used to download the data.

Thanks for the help and the patience @azat-badretdin

azat-badretdin commented 1 year ago

I found the link in the installation section of the wiki https://github.com/ncbi/pgap/wiki/Installation

Thank you! I was looking for the URL and apparently github does not index URLs inside links. :-(

azat-badretdin commented 1 year ago

I fixed the text of documentation you pointed to by URL. Please let me know what else we can do for you here.