topdownproteomics / sdk

Software solution for common top-down proteomics tasks
http://www.topdownproteomics.org/
MIT License
9 stars 4 forks source link

Chemical formula expansion and performance #41

Open acesnik opened 6 years ago

acesnik commented 6 years ago

From @rfellers:

I am curious what requirements others might have for a chemical formula interface. I was only focused on ProForma, but that stills means that we need to handle regular elements, pure isotopes of elements (e.g. C13), and Unimod "atoms" (which can additionally represent glycan residues and common molecules). Should we add to the benchmarking app to include chemical formulas? How important is performance?

acesnik commented 6 years ago

We are somewhat interested in performance, but our main concern is whether the results of the chemical formula interface give the same results as mzLib. We would eventually depend on the mass calculations and such to give the same results. You can find some tests for the mzLib implementation here. I think it does look promising in skimming the code; your implementation looks similar to mzLib, e.g. using the NIST database.

I'm not sure how we have handled Unimod shorthand for glycans. @rmillikin, do you know about that?

rfellers commented 6 years ago

Gotcha. What format for chemical formulas do you use, i.e. is there a name? Looks very similar to what we use at NU, but it has some custom stuff for isotopes. Unimod has a composition format and RESID/PSI-MOD uses something else.

Here's an example for Label:13C(9)15N(1): https://www.ebi.ac.uk/ols/ontologies/mod/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMOD_00589

Given all of these differences, I plan to have multiple parsers/writers that work with a generic IChemicalFormula interface. This means, however, that a simple ToString() on a chemicalFormula doesn't make sense unless we adopt one of the notations as a standard ...

acesnik commented 6 years ago

Wow, that's an unfortunate mess, isn't it? I think Unimod's is the most readable.

rfellers commented 6 years ago

Indeed, messy. The best I can tell, there is no standard way to write chemical formulas ... shall we start a ProFormula manuscript? :) Unimod is probably the best and it is what ProForma chose as the default, so we can lean towards that format as appropriate.

acesnik commented 6 years ago

Ha! ProFormula would be something.

Yes, I think we should lean towards Unimod's format, but writing multiple parsers would allow us to read all of those formats. That makes me wonder how the parser will distinguish the formula formats...

rfellers commented 6 years ago

Here's where my head is at presently:

ProForma standardized on Unimod format and will always assume the chemical formulas are written using that format (and throw errors accordingly).

Does that help at all or am I missing your point?

acesnik commented 6 years ago

That helps, thanks! I'm on board.