ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
301 stars 89 forks source link

[BUG] annot.fna contents same as input FASTA file instead of CDS #229

Closed aboffin closed 1 year ago

aboffin commented 1 year ago

Describe the bug I am sorry if I am missing something obvious but it looksannot.fna contains the same input genome sequence and not the expected CDS from genome. Is there a way to get the nucleotide gene sequences similar to predicted/annotated protein sequences (annot.faa)?

To Reproduce I am trying to annotate the GCA_900683725.1 genome, annot.faa contains 4249 protein sequences (as expected for a ~5 MB size genome), however the file annot.fna contains a single sequence and this is exactly the same as the input sequence:

diff annot.fna GCA_900683725.1.fasta 
1c1
< >lcl|NZ_LR215978.1 Parabacteroides distasonis strain Unknown chromosome, whole genome shotgun sequence
---
> >NZ_LR215978.1 Parabacteroides distasonis ATCC 8503 isolate Parabacteroides distasonis 82G9 chromosome 1

Expected behavior I expected that annot.fna contains a nucleotide FASTA file of CDS predicted/annotated from the input genome similar to annot.faa that contains the amino acid FASTA file of CDS.

Software versions (please complete the following information):

Log Files Please rerun pgap.py with the --debug flag and attach an archive (e.g. zip or tarball) of the logs in the directory: debug/tmp-outdir/*/*.log.

The debug log from tmp-outdir are here

Additional context If there are x contigs in the input FASTA, the corresponding annot.fna output contains x contigs as well. Is this the expected behavior?

azat-badretdin commented 1 year ago

Thank you for your feature suggestion, Senthil!

We are already planning to add this feature in the next release.

aboffin commented 1 year ago

Ah, it's not a bug, it's a feature 😄! I built some of my pipeline based on the assumption that annot.fna contained the nucleotide CDS, so I cannot wait for the next release. @azat-badretdin do you know when the next release is expected? Thank you!

azat-badretdin commented 1 year ago

I built some of my pipeline based on the assumption that annot.fna contained the nucleotide CDS

See http://defindit.com/readme_files/ncbi_file_extension_format.html for file extension definitions. This is not the original URL, but the meaning is the same in other copies I am sure

According to this resource:

.fna genome fasta sequence .ffn protein coding portions of the genome segments

azat-badretdin commented 1 year ago

https://github.com/ncbi/pgap/wiki/Output-Files

annot.fna: Genomic sequence(s) in FASTA format, as provided on input

aboffin commented 1 year ago

@azat-badretdin Thanks for the links. Mea culpa, I agree I misunderstood the contents of annot.fna.

It is still very surprising not to see the gene nucleotide sequence, since the bare minimum expectation is to see annotated gene (nucleotide and amino acid) sequences in the output of a genome annotation pipeline. It is also difficult to understand the usefulness of providing the same output as the input FASTA.

Any timeline on the next release will help me decide on the next steps. Thank you, I appreciate your help!

azat-badretdin commented 1 year ago

It is still very surprising not to see the gene nucleotide sequence

I am glad we are going to cover this glaring hole as well :-)

aboffin commented 1 year ago

Will the next release also take care of GFF file not having the Protein FASTA ids in the annot.faa as mentioned in issue #226? FWIW I am also very much interested in the quick PGAP annotation mentioned in the same thread.

thibaudnis commented 1 year ago

Hi - thank you for the feedback and requests. The next release will include a fasta of the CDSs , and a fasta of the translated CDSs. We also have a ticket to create an output file that contains the annotation in gff followed by the multifasta of the genomic sequence, which is I think what you are asking about.

azat-badretdin commented 1 year ago

@aboffin Senthil, are you interested in trying Roary? You enthusiastically reacted to Françoise's comment announcing our plans to add "gff followed by the multifasta of the genomic sequence". If you are, indeed, interested in this application, feel free to try our sample input to Roary attached to #226

aboffin commented 1 year ago

@azat-badretdin Hi Azat, thank you for the message. My enthusiasm was for the nucleotide CDS in the next release! 😄 I am sorry, I am unable to test Roary at the moment.

Edited to add: I asked about the GFF file since I am interested in reconciling nucleotide CDS headers and amino acid CDS headers for the same genome. In that thread it was mentioned that the protein FASTA headers and the one in the GFF files seem to differ. Sorry for any confusion.