New countIdentifications() function

lgatto commented 2 years ago

Hi @sgibb and @jorainer

This draft PR add a new function, nSequences() that takes a proteomics Spectra object with identification results (as a sequence spectra variable). It then counts, for each scan (MS1 and MS2) the number of identification that results from that scan. This is obvious and uninteresting for MS2 scans, as these are either 0 or one (depending whether sequence is NA or not). But for MS1 scan, the function could how many identifications where found for in the descendant MS2 scans. This can in turn be used to generate a plot as shown below

In addition to your review, I have some questions/comments:

Ok to add this function in it's own file nSequences.R.
I'm not convinced by the name of the function.
The function is quite slow. How could it be improved? I suspect that the current implementation of .filterSpectraHierarchy() is slow for such a use-case, where it is called for all MS1 scans. I can think of a vectorised and/or a C-level implementation. What do you think?

lgatto commented 2 years ago

Yes, let's generalise the name - I'm happy to hear this would also be useful for metabolomics data. But I'm not convinced by annotation. In proteomics, you could have an annotation that doesn't equate to a peptide sequence (i.e. the identification) - for example the presence of an ammonium ion.

What about using the term identification rather than sequence, which is just as good or a match in proteomics and seems to also fit metabolomics.

If you are happy with this, I would rename the function countIdentifications().