topdownproteomics / sdk

Software solution for common top-down proteomics tasks
http://www.topdownproteomics.org/
MIT License
9 stars 4 forks source link

Full v2 Spec + Proteoform Level Implementation by zrolfs #105

Closed rfellers closed 1 year ago

rfellers commented 1 year ago

After a 2-year hiatus, I finally got around to this :( ... Sorry for the delay!

Heavily influenced by @zrolfs' PR #84. I didn't take that PR directly because it had been so long that I had forgotten what we decided. Also, I wasn't sure if I'd be able to modify it. So, I worked through it myself again, pulling bits from the other PR as needed. Main difference between this PR and #84 is how to sequence ambiguities are handled. #84 adds a new List to ProFormaTerm specifically for this type, while I added a Sequence Ambiguity flag to the ProFormaTag and allowed an empty Descriptor list. There are pluses and minuses to both approaches, but I erred toward not expanding the already large ProFormaTerm API. I also added more tests to flesh out the full v2 specification (we were missing some late breaking additions).

Any feedback by @zrolfs or others at UW Madison (@Alexander-Sol @nbollis) would be great.

rfellers commented 1 year ago

There were also 2 bits in the proteoform levels tests that I wanted to discuss.

First, my understanding of ProForma doesn't allow for the following: EMEVEESPEK[Acetyl|Trimethyl]. I believe this is using the pipe operator to denote "this modification or that modification". However, that would mean that the mass of the proteoform is unknown, and I don't think we want to allow that. I feel like, in this case, a mass should be used to denoted unknown modification assignment. Thoughts?

Second, I want to discuss what proteoform level people think this should have. PROT(?EOSFORMS[+19.0523])ISKN w/ 2 genes. The test is written to assert that this is Level 5, but I'm not so sure. At question is whether the modification is localized. I'm on the fence about this ... there test is currently failing and Level 4 is being returned. A small tweak would be needed to change this.

acesnik commented 1 year ago

Exciting! I'm not sure I'll have time to review this until the week of Mar 6 or so, but I'm looking forward to checking it out.

acesnik commented 1 year ago

First, my understanding of ProForma doesn't allow for the following: EMEVEESPEK[Acetyl|Trimethyl]. I believe this is using the pipe operator to denote "this modification or that modification". However, that would mean that the mass of the proteoform is unknown, and I don't think we want to allow that. I feel like, in this case, a mass should be used to denoted unknown modification assignment. Thoughts?

The pipe character separates tags corresponding to the same modification, so I think this would be an invalid modification since it has two identifiers. I think in ProForma 2.0, we focused on notating ambiguous modification localization, not so much on ambiguous identity. I'd agree that a mass is probably most appropriate here; you could put the information about possible assignments in an info: tag.

That's kind of what we did in (iv) here:

image
acesnik commented 1 year ago

Second, I want to discuss what proteoform level people think this should have. PROT(?EOSFORMS[+19.0523])ISKN w/ 2 genes. The test is written to assert that this is Level 5, but I'm not so sure. At question is whether the modification is localized. I'm on the fence about this ... there test is currently failing and Level 4 is being returned. A small tweak would be needed to change this.

This is an interesting example. I agree this should be level 5. That serine modification isn't localized, and it's ambiguous with the other serine, so it should be forced into some sort of ambiguity notation upon reading/writing with a warning printed, or an error could be thrown upon parsing. (It should probably be outside the sequence ambiguity parentheses for simplicity and human readability.)

acesnik commented 1 year ago

This diagnostic ion point is still interesting to consider in this sequence ambiguity + modification business. I guess the lesson is that we shouldn't force modifications into the sequence out of ambiguous notation unless we have evidence to support disambiguation. https://github.com/topdownproteomics/sdk/pull/84#issuecomment-864115353

rfellers commented 1 year ago

Thanks @acesnik for jumping on this so quickly! I took a first pass through the comments, thanks for all that. I tried to beef up the parser tests in areas that changed, hopefully I didn't miss anything. I'll pick this up next week and add another commit to address a couple of the things you mentioned.

codecov[bot] commented 1 year ago

Codecov Report

Merging #105 (efd5a08) into master (c12296a) will increase coverage by 0.50%. The diff coverage is 96.69%.

Additional details and impacted files [![Impacted file tree graph](https://codecov.io/gh/topdownproteomics/sdk/pull/105/graphs/tree.svg?width=650&height=150&src=pr&token=T8Kf00TGxl&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics)](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics) ```diff @@ Coverage Diff @@ ## master #105 +/- ## ========================================== + Coverage 83.98% 84.49% +0.50% ========================================== Files 100 102 +2 Lines 5364 5539 +175 ========================================== + Hits 4505 4680 +175 Misses 859 859 ``` | [Impacted Files](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics) | Coverage Δ | | |---|---|---| | [...eomics/Proteomics/FiveLevelProteoformClassifier.cs](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics#diff-c3JjL1RvcERvd25Qcm90ZW9taWNzL1Byb3Rlb21pY3MvRml2ZUxldmVsUHJvdGVvZm9ybUNsYXNzaWZpZXIuY3M=) | `94.69% <94.69%> (ø)` | | | [src/TopDownProteomics/ProForma/ProFormaWriter.cs](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics#diff-c3JjL1RvcERvd25Qcm90ZW9taWNzL1Byb0Zvcm1hL1Byb0Zvcm1hV3JpdGVyLmNz) | `94.20% <97.56%> (+0.23%)` | :arrow_up: | | [src/TopDownProteomics/ProForma/ProFormaParser.cs](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics#diff-c3JjL1RvcERvd25Qcm90ZW9taWNzL1Byb0Zvcm1hL1Byb0Zvcm1hUGFyc2VyLmNz) | `95.98% <98.00%> (+3.39%)` | :arrow_up: | | [...DownProteomics/Biochemistry/BiochemistryUtility.cs](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics#diff-c3JjL1RvcERvd25Qcm90ZW9taWNzL0Jpb2NoZW1pc3RyeS9CaW9jaGVtaXN0cnlVdGlsaXR5LmNz) | `100.00% <100.00%> (ø)` | | | [src/TopDownProteomics/ProForma/ProFormaTag.cs](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics#diff-c3JjL1RvcERvd25Qcm90ZW9taWNzL1Byb0Zvcm1hL1Byb0Zvcm1hVGFnLmNz) | `100.00% <100.00%> (ø)` | | | [src/TopDownProteomics/ProForma/ProFormaTagGroup.cs](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics#diff-c3JjL1RvcERvd25Qcm90ZW9taWNzL1Byb0Zvcm1hL1Byb0Zvcm1hVGFnR3JvdXAuY3M=) | `100.00% <100.00%> (ø)` | | | [src/TopDownProteomics/Tools/MSNumpress.cs](https://codecov.io/gh/topdownproteomics/sdk/pull/105?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=topdownproteomics#diff-c3JjL1RvcERvd25Qcm90ZW9taWNzL1Rvb2xzL01TTnVtcHJlc3MuY3M=) | `80.25% <0.00%> (-0.98%)` | :arrow_down: |
rfellers commented 1 year ago

OK, I think we're getting closer. Main sticking point is the support of tags inside an ambiguity range or a tag directly after an ambiguity range. At present this supports both of those cases, but we are ahead of the spec and things could change. I lean towards leaning as is (given it's a little used feature) and if it's a breaking change in the future so be it. Thoughts?

acesnik commented 1 year ago

Sounds great! Nicely implemented!

acesnik commented 1 year ago

Feel free to merge.