qiime2 / q2-types

BSD 3-Clause "New" or "Revised" License
17 stars 41 forks source link

IMP: Expand alphabet for ProteinFASTAFormat #306

Closed Sann5 closed 11 months ago

Sann5 commented 11 months ago

Context

Proposed update

  1. Include U, the one-letter symbol for Selenocysteine, recognized by IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN).
  2. Include O, the one-letter symbol for Pyrrolysine, recommended since 2009 by the IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN), but not officially included in the standard one-letter code symbols.
  3. Include J, the one-letter symbol representing Leucine (L) or Isoleucine (I).
    • B and Z are already used to encode ambiguity between amino acids, so it would be natural to include J as well.
    • Even thou the use of J is still not part of the IUPAC standard, it is already commonly found in public databases. For example, this article claims that the RefSeq non-redundant proteins contained as of 2013, 15309 occurrences of the character J.
    • It was also been endorsed by institutions active in the bioinformatics community:
  4. Finally, I would like the format to allow for lowercase symbols.
    • Some use lowercase symbols to denote stereoisomerism.
    • Furthermore, most tools are case insensitive, effectively mapping lower-case symbols to their upper-case counterparts.
    • Therefore I do not think that this will lead to undefined behavior with most tools, but of course, the only way to be sure is to try or check with each tool (something I would very much like to avoid).

Motivation

I'm trying to use a database of protein sequences that contain all of the above. This is part of some enhancements I'm working on for the q2-moshpit plugin.

Sann5 commented 11 months ago

Thank you @gregcaporaso for the detailed review. I will make the proposed changes and re-request a review.

I ended up adding a MixedCaseProteinSequencesDirectoryFormat and a MixedCaseAlignedProteinSequencesDirectoryFormat (and some tests for them). Let me know if I should remove this.

gregcaporaso commented 11 months ago

@Sann5, just FYI we'll be getting back to this for a re-review shortly. Thanks for the updates!

lizgehret commented 11 months ago

Hi @Sann5,

Thanks for your patience! I'll review this today and follow up with any questions or requested changes.

lizgehret commented 11 months ago

On first glance, this all looks quite reasonable! I'm going to pull your changes down locally to test - unless anything unusual comes up, this should be good to go.

Sann5 commented 11 months ago

Thank you for the thorough review @lizgehret. Cheers!

gregcaporaso commented 11 months ago

@lizgehret, @Sann5 - I notice that all 26 english letters are now in this alphabet. Is that the intention? I just want to confirm that this doesn't contain any unintended characters. Scratch that, it seems right: 20 standard amino acids, plus X, U, O, J, B, Z = 26.

Thanks for the work on this one!