Closed Sann5 closed 11 months ago
Thank you @gregcaporaso for the detailed review. I will make the proposed changes and re-request a review.
MixedCaseProteinFASTAFormat
ProteinFASTAFormat
(to include the new proposed characters)MixedCaseProteinFASTAFormat
I ended up adding a MixedCaseProteinSequencesDirectoryFormat
and a MixedCaseAlignedProteinSequencesDirectoryFormat
(and some tests for them). Let me know if I should remove this.
@Sann5, just FYI we'll be getting back to this for a re-review shortly. Thanks for the updates!
Hi @Sann5,
Thanks for your patience! I'll review this today and follow up with any questions or requested changes.
On first glance, this all looks quite reasonable! I'm going to pull your changes down locally to test - unless anything unusual comes up, this should be good to go.
Thank you for the thorough review @lizgehret. Cheers!
@lizgehret, @Sann5 - I notice that all 26 english letters are now in this alphabet. Is that the intention? I just want to confirm that this doesn't contain any unintended characters. Scratch that, it seems right: 20 standard amino acids, plus X, U, O, J, B, Z = 26.
Thanks for the work on this one!
Context
ProteinFASTAFormat
is the alphabet in upper case (excludingJ
,O
andU
) plus the*
character.Proposed update
U
, the one-letter symbol for Selenocysteine, recognized by IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN).O
, the one-letter symbol for Pyrrolysine, recommended since 2009 by the IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN), but not officially included in the standard one-letter code symbols.J
, the one-letter symbol representing Leucine (L
) or Isoleucine (I
).B
andZ
are already used to encode ambiguity between amino acids, so it would be natural to includeJ
as well.J
is still not part of the IUPAC standard, it is already commonly found in public databases. For example, this article claims that the RefSeq non-redundant proteins contained as of 2013, 15309 occurrences of the characterJ
.Motivation
I'm trying to use a database of protein sequences that contain all of the above. This is part of some enhancements I'm working on for the q2-moshpit plugin.