ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
294 stars 89 forks source link

[BUG] Genus species requirement #309

Closed rsieskind closed 1 month ago

rsieskind commented 1 month ago

Describe the bug A clear and concise description of what the bug is.

When I want to use pgap for structurally and functionally annotate the new bacteriophage genome that I recently reconstructed, I am forced to provide the -s organism option. When I enter -s "Cobetia marina Bacteriophage 5" I get the error message "Unknown organism Cobetia marina Bacteriophage 5"

To Reproduce If you are having trouble with your genome, please ensure that you can run the pipeline with one of our test genomes first. If your installation works fine with the sample input, please tell us if you are willing and able to share your genome with us, if asked.

I made the quickstart guide mycoplasmoides-based example work properly. I can share my genome if needed.

Expected behavior A clear and concise description of what you expected to happen.

I wanted the pipeline to detect genes and to search for their potential functions from scratch. Does pgap really need a reference organism to work?

Software versions (please complete the following information):

Log Files Please rerun pgap.py with the --debug flag and attach an archive (e.g. zip or tarball) of the logs in the directory: debug/tmp-outdir/*/*.log.

Carin5_results5_debug.log

Additional context Add any other context about the problem here.

NA

azat-badretdin commented 1 month ago

Thank you, user @rsieskind for your report and following the proposed format! Truly appreciated!

As for the essence of the error, "Cobetia marina Bacteriophage 5" does not seem like a proper prokaryotic species name (says "phage"). It is even not present in our taxonomy database (I checked using our NCBI gettax application)

Also: I would recommend to upgrade your pgap version to the most recent one (you are using a version from last May, we had two releases after that) since we support only the latest release.

rsieskind commented 1 month ago

Thank you @azat-badretdin for your rapid answer.

The update of PGAP on the cluster I used will take time so I tried to install the last version locally (2024-04-27.build7426) and I launched a run with the inputs transferred previously by mail (command: './pgap.py -r -o genus_results1 genus.yaml > genus_results1.log') and hereafter are the logs. genus_results1.log cwltool.log I have the Docker version 1.13.1, build 7d71120/1.13.1

This time, I get the same error with the mycoplasmoides-based quick start example.

Apparently the --platform flag is not supported. I tried to comment the line calling it in the .py, but an update change the file back every time.

If we can fix this new bug, we may come back to the previous problem and I have an important question: Is PGAP able to annotate a completely new genus species?

azat-badretdin commented 1 month ago

Thank you, Rémi. The first log reads:

PGAP version 2024-04-27.build7426 is up to date.
Output will be placed in: /home/rsieskin/pgap-master/scripts/C5/Carin5_results1
PGAP failed, docker exited with rc = 125
Unable to find error in log file.

Also cwltool.log shows that execution seemingly does not even reach the stage of execution of CWL workflows.

This brings the focus to what is happening with your container runner. From this

apptainer version 1.3.1

I am concluding that under the disguise of docker you are actually running singularity which has been renamed to apptainer recently. Please try specifying --docker /your/path/to/singularity

Also: as you're aware we simplified user input and now instead of YAML file the user can supply two simple parameters: -s Taxonomy item and -g path/to/fasta/file, in this particular ticket this will eliminate the necessity of posting here YAML file in case something happens in the middle of actual execution.

rsieskind commented 1 month ago

Dear Azat,

I installed singularity on my machine and relaunched a job with the command ./pgap.py -r -o genus_results2 --docker singularity genus.yaml > genus_results2.log.

genus_results2.log cwltool.log

It is now working for the mycoplasmoides-based quick start example, but still not for my data. We are thus back to the previous problem.

I understand that the -s flag is facilitating the use of pgap.py, but I still have the problem that my genome is a brand new genome that has no closely-related organism. So, which genus should I give so that the annotation starts?

cristoyerenahs commented 1 month ago

Hello. Describe the bug My but is bug.

I want to use pgap for structurally annotation. I am working with Genus Actinomyces. I am forced to provide the -s organism option. When I enter only -s "Actinomyces" I get the error message "Fall to complete".

The documentation shows the possibility to put only the Genus, but it is false. In order to complete the job, you need the Genus and Species.

Could you help me?

azat-badretdin commented 1 month ago

User @cristoyerenahs could you please open a separate ticket? Thanks!

azat-badretdin commented 1 month ago

, which genus should I give so that the annotation starts?

@rsieskind

You seem to be trying to annotate a bacteriophage, no?

koneill54 commented 1 month ago

PGAP (Prokaryotic Genome Annotation Pipeline) is designed and optimized to annotate bacteria and archaea, not viruses or phage which is why your bacteriophage does not exist in the organism database. Prophages which are incorporated into a bacterial genome (like the well-studied prophages of Salmonella) are annotated using the bacterial genus species designation. If you choose, you can use Cobetia marina as the organism but be aware that the results may be questionable.