ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

Final process status is permanentFail #177

Closed marieleoz closed 2 years ago

marieleoz commented 2 years ago

Hello,

I'm trying to use PGAP on a cluster as described here: http://bioinfo.genotoul.fr/index.php/how-to-use/?software=How_to_use_SLURM_PGAP

My submission script is based on theirs (I just renamed it as .txt because .sh files can't be attached): pgap-2021-07-01.build5508_MLE.txt

It looks like the run is successful because I get my .faa .fna .gbk .gff and .sqn annot files, but the cwltool.log file mentions lots of warnings and ends with: "Final process status is permanentFail" cwltool.zip

Can I trust my results and use the files for further analyses? Thanks a lot.

Best, Marie

azat-badretdin commented 2 years ago

Can I trust my results and use the files for further analyses?

At the first glance immediate answer would be "no", because of "permanentFail" status. But the presence of final results (flat files, for example) made me curious on what is going on in your case. Let me see...

azat-badretdin commented 2 years ago

And thank you for your report, Marie!

azat-badretdin commented 2 years ago

When looking at cwltool.log with the status permanentFail the most informative message is the FIRST mention of permanentFail, which in your case is


[2021-12-07 19:34:31] DEBUG [job Final_Bacterial_Package_asndisc_evaluate] initial work dir {}
[2021-12-07 19:34:31] INFO [job Final_Bacterial_Package_asndisc_evaluate] /pgap/output/debug/tmp-outdir/dp8ym6_i$ xml_evaluate \
    -input \
    /pgap/output/debug/tmpdir/_zh81aht/stg21a8a5ce-02c2-40e0-8522-d49d88b5661a/annot.disc \
    -xpath-fail \
    '//*[@severity="FATAL"]' > /pgap/output/debug/tmp-outdir/dp8ym6_i/final_asndisc_diag.xml

Good news is that in terms of your SLURM environment PGAP "worked" in a sense that you got a meaningful result.

Bad news is that it has some problem diagnosed by our NCBI asndisc tool. The key is the file final_asndisc_diag.xml it should be either in the /output/ directory, or if not, it should be somewhere in /debug-extra/ directory. If you do not have this directory, try rerunning pgap.py with --debug option. If you are running your customized SLURM script, the call to pgap.py I am guessing (since we are not responsible for this script) must be inside that script.

azat-badretdin commented 2 years ago

The key is the file final_asndisc_diag.xml

Feel free to post that file or examine messages under severity=FATAL XML element, to see if they are helpful to you to understand what is going on.

marieleoz commented 2 years ago

Hi Azat!

Thanks for your answer :) Here's what's in the final_asndisc_diag.xml file:

Failer nodes: <?xml version="1.0" encoding="UTF-8"?>

<?xml version="1.0" encoding="UTF-8"?>

Hopefully it makes more sense to you than to me :) Thanks!

marieleoz commented 2 years ago

Not sure you can actually see what I pasted so here's the content in a .txt file:

final_asndisc_diag.txt

azat-badretdin commented 2 years ago

Thanks, Marie, this is useful:


Failer nodes:
<?xml version="1.0" encoding="UTF-8"?>
<test name="SHOW_HYPOTHETICAL_CDS_HAVING_GENE_NAME" description="Hypothetical CDS with gene names" severity="FATAL" cardinality="1">
    <details message="## hypothetical coding regions have a gene name" severity="FATAL" cardinality="1" unit="hypothetical coding region" autofix="true">
      <object type="feature" file="/pgap/output/debug/tmpdir/28968fcr/stgea061514-d09a-47fc-b556-40252de6b670/annot-wo-checksum.sqn" feature_type="CDS" product="IS66 family insertion sequence hypothetical protein" location="lcl|contig_00092:695-1084" locus_tag="pgaptmp_004933" label="CDS&#9;IS66 family insertion sequence hypothetical protein&#9;lcl|contig_00092:695-1084&#9;pgaptmp_004933"/>
    </details>
  </test>

<?xml version="1.0" encoding="UTF-8"?>
<details message="## hypothetical coding regions have a gene name" severity="FATAL" cardinality="1" unit="hypothetical coding region" autofix="true">
      <object type="feature" file="/pgap/output/debug/tmpdir/28968fcr/stgea061514-d09a-47fc-b556-40252de6b670/annot-wo-checksum.sqn" feature_type="CDS" product="IS66 family insertion sequence hypothetical protein" location="lcl|contig_00092:695-1084" locus_tag="pgaptmp_004933" label="CDS&#9;IS66 family insertion sequence hypothetical protein&#9;lcl|contig_00092:695-1084&#9;pgaptmp_004933"/>
    </details>

It looks to me that we already have enough material to start looking into this.

azat-badretdin commented 2 years ago

But before doing this I have noticed that you are still using July version of PGAPx. It is very possible that this particular evidence is gone now (we are double checking it ourselves as well)

Feel free to switch to a newer version in your script and try again. Besides this particular issue that has a chance to be resolved, there are other improvements that you might want. Also, generally, using the latest version is recommended.

marieleoz commented 2 years ago

Thanks Azat, I'll ask for the update, try again and let you know.

azat-badretdin commented 2 years ago

More evidence in support of the update: I just got a response from one of the curators of biological data indicating that this particular insertion sequence family name have been corrected in newer evidence sources.

marieleoz commented 2 years ago

Dear Azat,

Sorry it took me a little while to get this done, but I fear I got something alike with pgap_2021-11-29.build5742 I attach the cwltool.log (zipped) and final_asndisc_diag.xml file (renamed as .txt), but please let me know if there's anything else that could be helpful. cwltool.zip final_asndisc_diag.txt

Thanks a lot! Marie

azat-badretdin commented 2 years ago

Thanks, Marie. We will have a look at this.

azat-badretdin commented 2 years ago

Apologies for the long gap, Marie.

The message looks like


<?xml version="1.0" encoding="UTF-8"?>
<test name="SHOW_HYPOTHETICAL_CDS_HAVING_GENE_NAME" description="Hypothetical CDS with gene names" severity="FATAL" cardinality="1">
    <details message="## hypothetical coding regions have a gene name" severity="FATAL" cardinality="1" unit="hypothetical coding region" autofix="true">
      <object type="feature" file="/pgap/output/debug/tmpdir/y1xfp__8/stg03104f5c-13f0-48dc-a275-1070d0f8eed2/annot-wo-checksum.sqn" feature_type="CDS" product="IS66 family insertion sequence hypothetical protein" location="lcl|contig_00144:6485-6865" locus_tag="pgaptmp_004276" label="CDS&#9;IS66 family insertion sequence hypothetical protein&#9;lcl|contig_00144:6485-6865&#9;pgaptmp_004276"/>
    </details>
  </test>

Which indicates to our extensive validators failing the output because of the shown error: IS66 family insertion sequence hypothetical protein

This issue has been resolved in the new release of PGAPx.

Please feel free to install it and try, Marie.