ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
294 stars 89 forks source link

Error: Final process status is permanentFail #305

Closed sekhwal closed 2 months ago

sekhwal commented 2 months ago

Hi, I am following my previous issue #304, it has been closed.

I am already using 'salmonella' in submol.yaml, but I am not able to get the results. When I change the genus_species as 'Escherichia coli' pgap keep running for so long with generating the results.

topology: 'circular' location: 'chromosome' organism: genus_species: 'salmonella' strain: 'P1620800_chr'

azat-badretdin commented 2 months ago

1/ Salmonella is not "species", it's "genus" 2/ We lost the functionality of supporting "genus" option in this release and we are working on restoring it soon 3/ Case might be important (usually biologists always capitalize genus in binomials, so I am not familiar with this use case).

Please try

genus_species: 'Salmonella enterica'

or other legitimate Salmonella species.

sekhwal commented 2 months ago

Thank you for the information. It works, but when I run it with location: 'plasmid' it generates the same error "Final process status is permanentFail".

Please let me know what change I should make in the submol.yaml file. Here is the information of my current submol.yaml file that I am trying to run for plasmid genome.

topology: 'circular' location: 'plasmids' organism: genus_species: 'Salmonella enterica' strain: 'P1122481'

azat-badretdin commented 2 months ago

location: 'plasmids'

Should be strictly 'plasmid' or 'chromosome'

azat-badretdin commented 2 months ago

You can also try using our relatively new way of running pgap.py specified in quick notes, where all the information is in FASTA file and species qualification:

./pgap.py .... -s 'My species' -g My.fasta

In this case you can specify plasmid molecules by appending [location=plasmid] to your FASTA definition lines for corresponding sequences

sekhwal commented 2 months ago

I tried the following way python3 /scripts/pgap.py -r -o P1122481_results -s 'Salmonella enterica' -g P1122481.fasta

I am using the fasta file with the header

1_length=4998493_depth=1.00xcircular=true[location=chromosome]

But still generating the issue ""Final process status is permanentFail".

sekhwal commented 2 months ago

In another way, I used correctly location: 'plasmid' in in the submol.yaml but it still unable to run.

topology: 'circular' location: 'plasmid' organism: genus_species: 'Salmonella enterica' strain: 'P1122481'

thibaudnis commented 2 months ago

1_length=4998493_depth=1.00xcircular=true[location=chromosome]

Please review https://github.com/ncbi/pgap/wiki/Input-Files#Genome-assembly-sequence-file. There are several characters that are not allowed in this SeqID (the SeqID is everything before the first space). You can try SeqID of 1 and add modifiers: 1 [topology=circular] [location=chromosome] Length and depth are not supported modifiers according to: https://www.ncbi.nlm.nih.gov/genbank/mods_fastadefline/

azat-badretdin commented 2 months ago

But still generating the issue ""Final process status is permanentFail".

Could you please post the resulting cwltool.log file? Thanks!

sekhwal commented 2 months ago

It seems the header line is correct. And it is still showing an error "WARNING Final process status is permanentFail " with plasmid sequence. However, it works with 'chromosome' even I did not change any in the header ">1 length=4998493 depth=1.00x circular=true".

used command

python3 /scripts/pgap.py -r -o P1122481_plasmid input_P1122481_plasmid.yaml

plasmid fasta file header

contig001 [location = plasmid] [plasmid-name = pPSU1122481] [topology=circular]

Here is the .yaml file

fasta: class: File location: P1122481_plasmid.fasta submol: class: File location: P1122481_plasmid1_submol.yaml

cwltool.log topology: 'circular' location: 'plasmid' organism: genus_species: 'Salmonella enterica' strain: 'pPSU1122481'

sekhwal commented 2 months ago

It seems, it does not work with small genomes like plasmid. I used pgap earlier and it worked perfectly without concerning about any specify header and special letters. Should I download old version and try?

azat-badretdin commented 2 months ago

Try ./pgap.py --ignore-errors ....

sekhwal commented 2 months ago

It works when I use both chromosome and plasmid in one fasta file. I think the latest pgap version has issue of having small genome like plasmid.

command

python3 /scripts/pgap.py -r -o P2226300_results input_P2226300.yaml Thank you for your help!

azat-badretdin commented 2 months ago

It works when I use both chromosome and plasmid in one fasta file.

Because with chromosome, the total size of the genome matches the expectation for this particular species.

It does not reject plasmids per se (you can try to replace kewword plasmid with chromosome in that small plasmid FASTA file) and see for yourself - the result will be the same, because it rejects by size, not by molecule type

Have you tried inserting --ignore-errors into the list of command line switches?

vappiah commented 2 months ago

@azat-badretdin I have a similar issue. Please find attached my cwtool.log file cwltool.log

azat-badretdin commented 2 months ago

User @vappiah I am not so sure. It says

'contig001[location=chromosome]' is not a valid local ID (m_Pos = 1)

which most likely means that you omitted quite crucial space delimiter separating seq-id from the rest of FASTA definition line

It's a different error from the same ballpark "things that users do in FASTA definition line"

vappiah commented 2 months ago

Thanks @azat-badretdin . I made the necessary correction and it works now.

azat-badretdin commented 2 months ago

Glad to hear that, user @vappiah !