ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
301 stars 89 forks source link

[FEATURE REQUEST] Additional check: if sequence is circular, it must also be either chromosome, plasmid, or extrachromosomal #251

Closed MrTomRod closed 1 year ago

MrTomRod commented 1 year ago

Is your feature request related to a problem? Please describe. I wanted to submit some genomes I annotated using PGAP on GenBank. I got this error:

ERROR: valid [SEQ_INST.CircBactGenomeProblem] Circular Bacteria or Archaea should be chromosome, or plasmid, or extrachromosomal BIOSEQ: lcl|STRAIN_scf1: raw, dna len= 2345678

Describe the solution you'd like

PGAP should have aborted with the same error message.

Describe alternatives you've considered

I have simply added the following line to the appropriate place in the .sqn/ASN.1 file:

...
      seq-set {
        set {
          class nuc-prot,
          descr {
            source {
              genome chromosome, <<< this is the line I added
...

Thanks for your nice pipeline!

azat-badretdin commented 1 year ago

Thank you for your report, Thomas! Glad to see you back!

I have simply added the following line to the appropriate place in the .sqn/ASN.1 file

Current input to PGAPX is a FASTA file. It would be nice to reproduce this from FASTA input.

MrTomRod commented 1 year ago

You remember me! :smiley: Happy to work with you again, too, Azat.

Not sure what PGAPX is. The FASTA for PGAP looked like this, it had only one, circular contig:

>STRAIN_scf1 [topology=circular][completeness=complete]
[...DNA...]

It the genome was annotated using PGAP 2020-02-06.build4373.

Actually, you can download the data here: assembly, gbk, sqn. (For submission, I had to remove all -p1-1.1 and -p1-1 from the file.)

azat-badretdin commented 1 year ago

Not sure what PGAPX

PGAPX is a flavor of PGAP that is the product of this github repo.

Actually, you can download the data here: assembly, gbk, sqn

Thanks for the data, so I presume this does not fail when you run ./pgap.py on "assembly" file, but the essentially same genome fails the genome submission when you post ASN.1 file - the output of running ./pgap.py?

MrTomRod commented 1 year ago

I presume this does not fail when you run ./pgap.py on "assembly" file, but the essentially same genome fails the genome submission when you post ASN.1 file - the output of running ./pgap.py?

Exactly.

azat-badretdin commented 1 year ago

Thanks, Thomas. I opened an internal ticket for this, we will address this issue.

azat-badretdin commented 1 year ago

Thomas, we actually need the input FASTA file as well.

MrTomRod commented 1 year ago

I already posted it:

Actually, you can download the data here: assembly (...)

Am I missing something?

azat-badretdin commented 1 year ago

My apologies, Thomas, internally we use the term "assembly" for a separate object that is a collection of sequence references and associated metadata, so I skipped over your link.

azat-badretdin commented 1 year ago

Would you mind posting your submol.yaml file as well?

azat-badretdin commented 1 year ago

I just noticed that you are using a three year old release version: 2020-02-06.build4373.

Have you tried using a newer version to see if this is reproducible?

MrTomRod commented 1 year ago

No worries.

Here's the submol.yaml, and here cwltool.log.

How should I reproduce it? Annotate it with the new pipeline and check if the genome chromosome is present in ASN.1?

(I don't have the time to install the old pipeline or make a fake submission of an already-submitted genome.)

azat-badretdin commented 1 year ago

Annotate it with the new pipeline and check if the genome chromosome is present in ASN.1?

Feel free to post the ASN.1 here and we will run our standard validation tools on it.

I don't have the time to install the old pipeline

I am not sure I understand. You posted that you were using "old pipeline" of 2020, no?

azat-badretdin commented 1 year ago

I examined your FASTA file and YAML file - with this input, it should be treated as chromosome.

Suggesting to use most recent version is our current practice since we do not support older versions.

MrTomRod commented 1 year ago

Thanks for fixing the issue (I think)!

Sorry, I forgot to reply:

Feel free to post the ASN.1 here and we will run our standard validation tools on it.

I already linked this file above: .sqn

I am not sure I understand. You posted that you were using "old pipeline" of 2020, no?

Yes, I annotated this genome in 2020 with the then-current pipeline. Now I'm using the current one.