Open trishorts opened 6 years ago
Excellent question, I've wondered the same thing. I don't believe that the standard addresses this, right? I've always imagined a file with a single ProForma Term on each line, but the idea of a FASTA-like setup is interesting. Would provide a place for metadata ... but honestly, I'm not sure that there would be that much to say about specific proteoforms.
This is being discussed here, too: https://github.com/topdownproteomics/ProteoformNomenclatureStandard/issues/11
Personally, I'm in favor of FASTA-like setup to allow specifying metadata (accessions, ontologies).
Should we allow the files to be split at 60 characters per line? That could be pretty strange if tags get split in the middle.
Examples:
PROTEOFORMPROTEOFORMPROTEOFORMPROTEOFORMPROTEOFORMPROT[Phosp
ho]EOFORM
ROTEOFORMPROTEOFORMPROTEOFORMPROTEOFORMPROTEOFORMPROT[mass:7
9.98]EOFORM
FASTQ files are never split at 60 characters, so there is precedent for requiring no line breaks. But if we're going to call it a ProForma FASTA file, we should allow line breaks.
my 2 cents: One should allow line breaks if we expect people to manually edit these files. Given the specific set of modifications that users might want, I assume this will be the case. That being said, should we allow tags to be broken across 2 lines? That seems messy, but parsers wouldn't really care as we'd just mash everything together
Despite the mess, I think we should allow tags to be broken across lines. Mashing together worked for all sorts of potatoes last week, so it should work for us here.
wondering about reading/writing multiple proteoforms. Does a proteoform have to be completely on a single line? if not, is there a line length (like fasta). Does each proteoform have to start with a specific character to know when the thing begins, like ">"?