loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
37 stars 2 forks source link

We currently include stop codons in protein sequences #2841

Open theosanderson opened 1 month ago

theosanderson commented 1 month ago

At least in production, we currently include stop codons (*) at the end of our protein sequences. I think to me this feels quite unexpected. NCBI doesn't seem to do it: https://www.ncbi.nlm.nih.gov/protein/YP_011036812 .

We should think through what the ideal behaviour on this is.

chaoran-chen commented 1 month ago

Interesting to see that NCBI doesn't! Nextclade always includes the stop codon if I'm not mistaken, so I think that I have only ever worked with protein sequences that end with it. Do you see pros/cons for including/not including it?

theosanderson commented 1 month ago

It would be cool to ask the Nextclade/Nextstrain folks as I guess they will have views on the pros of including it.

Basically, I find it confusing that it is included. I think of these as amino acid sequences, and stop isn't an amino acid. UniProt is the main protein sequence database and also doesn't include them https://www.uniprot.org/uniprotkb/Q6SJ61/entry#sequences (and same for NCBI as discussed). It feels redundant given that it's always this ending. It just confused me when I was adding a new prototype organism and forgot to include these in the yaml, and the same happened with one of our collaborators who was testing Loculus for their organism. ChatGPT (fwiw) says that not including them is more common, and mentions the PDB too which is another point of comparison https://chatgpt.com/share/66eb36d9-782c-8005-9126-a24d3c105a6b . It also means that the length of the protein sequence we provide in characters is one more than the actual length of the protein.

chaoran-chen commented 1 month ago

ok, interesting! I have no preference regarding this but a general question: if we decide to (or not to) change this for Pathoplexus / the Nextclade-based pipeline, do you think that we should regulate this on the Loculus-level, i.e., enforce that an AA sequence must or must not end with a stop codon (the latter could mean that we remove the stop codon from the list of accepted characters for an AA sequence)?

theosanderson commented 1 month ago

I guess for me that depends a lot on whether we decide to change it :). The idea of enforcing that there must be a stop codon (which I guess we don't really do atm, just through the Nextclade pipeline) seems bad to me due to my general preference against them. ~The idea of enforcing that there can't be a stop codon seems potentially reasonable, but it's not something I feel strongly about.~ (I.e. supporting both depending on admin-choice seems OK)

[Edit: on reflection - I don't think enforcing an absence of * would make sense for Loculus - flexibility for admins seems good]

corneliusroemer commented 1 month ago

One reason to include them is that this way one can detect when the stop is no longer a stop. I don't know for sure why we do what we do in Nextclade, but this may be a reason. @ivan-aksamentov @rneher will know better.

ivan-aksamentov commented 1 month ago

In presence of sequencing defects and in some weird organisms there could be 0, 1 or multiple stop codons in a gene after computational translation. Nextclade doesn't know which one is the true end-of-sequence codon, if any, and it chooses not to speculate. That is, we don't truncate sequences and we emit all stop codons. This is also the simplest to implement.

This kind of questions seem to be common in computational biology - when data has defects or is incomplete, do you want biologically plausible results or the most practically useful results? Do you want Nextclade to behave as a ribosome and obey all biology laws or do you want full sequence, even past the accidental stop in the middle of sequence. As a practical tool we chose the latter. Technically this could be changed with a toggle or multiple.

I cannot speak about NCBI - maybe they read true peptides instead of doing analytical translation, or maybe they have a particular stance on this, or maybe they simply did not think about it much.

I personally think it does not hurt to have all stop codons explicitly. In the meantime, if you disagree, then it could be fixed with a simple transform in post-processing.

theosanderson commented 1 month ago

Yes, to be clear I'm definitely not proposing any change to Nextclade.

Thanks, that's useful. Yes, I hadn't fully appreciated that the CDS is annotated up to the stop codon - so with the NCBI approach your nucleotide sequence isn't exactly 3x as long as your amino acid sequence, which non-ideal too.

ivan-aksamentov commented 1 month ago

to be clear I'm definitely not proposing any change to Nextclade

I don't see why not. If you come up with a good way to manage stops, then we could "borrow" (steal :)) it as an option or as a breaking change for a new major version. I tried to explain a little that it might not be as straightforward as it seems though.

The length problem seems to be a real problem which I did not realize until now. I don't think Nextclade emits peptide length explicitly, but it might be used to calculate other metrics, making them slightly imprecise. It's something that is worth at least documenting.