Duplicated spectra in spectrum-centric dataframe

veitveit commented 5 months ago

At least happening for the example dataset.

michabirklbauer commented 3 months ago

So this seems to be an issue related to fragannot, the original result returned by fragannot contains several entries for every spectrum (that sometimes are the same and sometimes different). Therefore we get these duplicates - which btw also happens for the fragment-centric dataframe.

michabirklbauer commented 3 months ago

This is the original result from fragannot result.json

michabirklbauer commented 3 months ago

I can add a filter that keeps track of everything that was already added to the dataframes to avoid duplicates, but maybe we should check if everything is right with fragannot (e.g. why we get these duplicate results)?

levitsky commented 1 month ago

I think the source of duplication can be the fact that the identification file contains multiple identifications per spectrum. For example the SpectrumIdentificationResult for spectrum 1082 (in the screenshot) contains 28 SpectrumIdentificationItem elements corresponding to different peptidoforms of the same primary sequence. These peptidoforms produce partially overlapping annotations.

Note that as far as I can see in the code, the idea was to only read the top ranked identification. in this case, however, some, but not all of them have the same score, and six identifications are marked as rank 1, so they are all processed.

michabirklbauer / internal_ions

Duplicated spectra in spectrum-centric dataframe #48