lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
210 stars 39 forks source link

Multiple hits of rank 1 #46

Closed lgatto closed 1 year ago

lgatto commented 1 year ago

I might be missing something here, but I was surprised to see two hits of rank 1 for the same scan:

scannr peptide proteins rank expmass calcmass charge hyperscore
controllerType=0 controllerNumber=1 scan=28494 [+304.207]-ELDALDANDELTPLGR sp|Q08211|DHX9_HUMAN 1 2045.078 2045.060 3 42.98145
controllerType=0 controllerNumber=1 scan=28494 [+304.207]-AIVAIENPADVSVISSR tr|A0A8I5KQE6|A0A8I5KQE6_HUMAN;sp|P08865|RSSA_HUMAN 1 2045.063 2044.149 3 51.81654

Any idea? Thanks in advance.

lgatto commented 1 year ago

Just to add to this that chimera was set to true for that run.

lazear commented 1 year ago

Good catch; the rank column is a late addition and this wasn't considered. As the code is currently written, this is the expected behavior when chimera is set to true - but perhaps not the correct behavior.

For self, easy fix by manually mutating rank here: https://github.com/lazear/sage/blob/0ca2d5d45aeb4748c786d0decbf5a906ccbb3621/crates/sage/src/scoring.rs#L457

lgatto commented 1 year ago

I think documentation would go a long way here.

If I understand correctly

What would happen to the second matches if chimera was set to false? Would that spectrum not be identified at all; or if multiple PSMs were returned, they would simply be ranked together (as in case 1 above)?

lgatto commented 1 year ago

It would actually also be helpful to easily sport chimeric spectra. It my understanding above is correct, I should be able to identify those as PSMs that have both rank 1 for the same scan. To help, it would be useful to have a 'chimera' column that identifies these candidates.

lazear commented 1 year ago

I think documentation would go a long way here.

If I understand correctly

* Different identifications from the same set of matched fragments would be ranked based on their score.

* Chimeric identifications from different sets of fragments would get ranked independently and will thus have the same ranks.

What would happen to the second matches if chimera was set to false? Would that spectrum not be identified at all; or if multiple PSMs were returned, they would simply be ranked together (as in case 1 above)?

This is correct - there are only two scenarios where multiple IDs are returned for the same spectrum:

  1. report_psms = N where N > 1, in which we report back 1..N IDs against the same set of peaks, each with rank 1..N
  2. chimera = True, in which case we currently report both IDs as rank one (given that they are derived from different sets of peaks - the second/chimeric peptide is IDed against the parent spectrum with matching peaks removed)

If chimera = true and report_psms > 1, then report_psms is ignored, and chimeric search proceeds as normal

I am open to adding a chimera column (set to true for all second IDs?), but I wonder if it would be more useful to just report rank = 2 for the second/chimeric peptide

lgatto commented 1 year ago

I think that chimeric scan and multiple PSMs for a scan are conceptually different. You enforce reporting a single PSM when setting chimera, but in theory (unlikely?), there could be multiple ranks for the different set of fragments. I had missed (or forgotten) that report_psms was ignored when searching for chimeric scans, so maybe, if the documentation is clear on that, the current situation is clear enough:

I think it is important to discriminate these two from the results themselves, without the need for configuration file (that might be missing when reporting data in a paper). If you were to change the rank in the first case, it would be more difficult to distinguish these.

lazear commented 1 year ago

Added DOCS.md that explains the current sitatuion