rickhelmus / patRoon

Workflow solutions for mass-spectrometry based non-target analysis.
https://rickhelmus.github.io/patRoon/
GNU General Public License v3.0
65 stars 19 forks source link

MetFrag Score Calculation - How are they calculated in patRoon? #114

Open Theodoralee27 opened 2 months ago

Theodoralee27 commented 2 months ago

Dear Rick and Team,

I would like to ask for help regarding how the scores for each feature are calculated from the generateCompounds() function. My main objective is to classify the features identified using Schymanski's level of confidence (Levels 2a - 5). Currently the most relevant function that links both the scores calculated in patRoon and Schymanski's confidence levels is the annotateSuspects function - but I realised this is only relevant for suspect screening, while the data im interested in classifying is the non-targeted data.

Here is a preview of what my data looks like after I have use the generateCompounds() function using metfrag and the Comptox database. image

  1. My first question I would like to clarify is, is the score (in column D; highlighted) calculated in the same weighted format as this study Lai, A. et al. (2021) ‘Retrospective non-target analysis to support regulatory water monitoring: from masses of interest to recommendations via in silico workflows’, Environmental Sciences Europe. Springer Berlin Heidelberg, 33(1), pp. 1–21. doi: 10.1186/s12302-021-00475-1.

Here is a screen shot of their calculations: image

I tried very briefly to calculate using the data I have, the resulting scores are pretty similar. I would like to confirm is that the correct calculation method? If not, can you kindly refer me to any document that discusses the calculation of the resultant scores?

  1. My second question is a problem developed from the first - with the known max score of 9.0, I had two features in my dataset with a score of 13 and above. I was very confused which made me doubt the calculation methods for the scores, but when I went to investigate each parameter, I realised the value for 'DATA_SOURCES' for these two features were unusually high, around 40 - 42, while the rest of the other features only ranged about 1 - 8. Because of the way each parameter is normalised against the largest value, I was wondering if the system could recognise high outlier values like the 40 and 42, such that they are excluded from the analysis, otherwise all the other compounds scores will be very low for that parameter ("DATA_SOURCES") - again this is my speculation. I will upload one batch of data with that feature for your reference. Compounds_comptox_Npb6.csv

  2. Is there an established conversion between metFrag scores and Schymanski's confidence levels? Unfortunately I havent been able to find much supporting evidence for this. The closest would be the use of the IndividualMoNA score > 0.9 to match a Schymanski level 2a (same study Lai et al 2021). Another study suggested a range of MoNA scores for levels 3a and 3b in Talavera Andújar, B. et al. (2022) ‘Studying the Parkinson’s disease metabolome and exposome in biological samples through different analytical and cheminformatics approaches: a pilot study’, Analytical and Bioanalytical Chemistry. Springer Berlin Heidelberg, 414(25), pp. 7399–7419. doi: 10.1007/s00216-022-04207-z. image But all my values for the IndividualMoNA for all my features were all 0 - would this mean that all my features are of level 3c or lower confidence?

Thank you so much for your help in advance! I would really appreciate if you can share any details regarding the score calculation!

Sincerely, Theo

rickhelmus commented 2 months ago

Hi Theo,

Thanks for your interest and added details to your questions. I will try to answer everything in order, but please let me know if things need to be further clarified.

You are right the annotateSuspects() function is indeed used for ID level estimation, but currently limited to suspect screening workflows. However, it is relatively easy to use implement a similar approach. Also, the next major version of patRoon will also provide estimations for full NTA workflows. If you want to get a hint on how things currently work for suspects, please also see the default ID level configuration file.

Now to the numbered questions:

  1. The score column is indeed defined as the weighted sum of all other scores. This is exactly as what MetFrag outputs, but additionally may include the 'formula score' if you ran the addFormulaScoring() function. The scores are indeed normalized, unless MetFrag already did so (e.g. individualMoNAScore, see also the defaultExclNormScores() function). The weights in patRoon are by default 1.0, to change them see e.g. the last example here. Since the defaults are unlike those in the Table you cited, and you might have used addFormulaScoring(), the total score you see may indeed by higher.

  2. There is currently no outlier detection. But this would be quite an interesting idea. Perhaps an approach could be to implement a function that re-calculates the score after e.g. using delete() to manually remove some outliers. One complication, though, is that the score calculation from MetFrag is always from all candidates, while patRoon by default only loads the the top 100 per feature (configured by the topMost parameter). Playing with weights may of course also be an option for some specific scoring terms. If you suspect the value of dataSources is really off, it might als be good to double check the results from MetFrag directly, e.g. through the web interface, or have a look at CompTox online. If there is a difference, please let me know, as the values should be the same with patRoon.

  3. The way patRoon uses scores for ID level estimation you can find in the ID file that I linked before, which corresponds to the table you refer to. It doesn't use the MetFrag score directly (although you could change this if you want). The indivudualMoNA score of MetFrag represents the maximum spectral similarity of your experimental MS data and all the spectra for the candidate that are available in the MassBank of North America. So a score of 1.0 is a perfect fit. For zeros it simply means that the candidate is absent on MoNA (which unfortunately happens a lot!). For level 3c a similar approach is used but with silico data, i.e. the 'annotationSimilarity`: this is simply the spectrum similarity between your complete MS2 spectrum and your MS2 spectrum with only the annotated peaks.

    I hope this clarifies things a bit, but don't hesitate to ask if you want to know more!

Thanks, Rick

Theodoralee27 commented 2 months ago

Hi Rick,

Thank you so much for taking the time to explain about the scoring system, I truly appreciate your help!

  1. In my original workflow, I did not run the addFormulaScoring() function. So i'm still wondering why some of the scores exceeded 9.0. I will rerun this again to confirm, but maybe if i have to, I might resort to a manual calculation based on the weighted sum formula.

  2. When I was generating compound annotations using generateCompounds(), I used the Comptox database as I was more interested in this cluster of compounds, and was also interested to investigate more with the availability of Comptox APIs currently available.

compounds_Comptox_filtered_NPb5 <- generateCompounds(fGroups_filtered_NPb5, mslists_filtered_NPb5 , "metfrag", database = "comptox", adduct = "[M+H]+")

I'm wondering if this might have had played a role in the lack of detection against the MoNA database. I will attempt to try the workflow with the MoNA libraries and see how it goes!

  1. I was thinking of a possible workaround to the annotateSuspects() function being only available for the suspect screening workflow. I thought of extracting the resultant compound data from Comptox, set it up as a dataframe like a list of suspects, and run it through the annotateSuspects() function. I plan to follow the workflow as you had suggested in a previous issue here: https://github.com/rickhelmus/patRoon/issues/108#issuecomment-2089766775 Do you think this is a suitable work around, or are there any pitfalls you foresee using such a workflow?

List_comptox_NPb5 <- data.frame(name=c("DTXSID90279824","DTXSID40636327",...), formula = c("C31H45NO","C31H45NO",...)) Screen_comptox_NPb5 <- screenSuspects(fGroups_filtered_NPb5, List_comptox_NPb5, adduct = "[M+H]+", onlyHits = TRUE) checkFeatures(Screen_comptox_NPb5) Annotate_NPb5<- annotateSuspects(Screen_comptox_NPb5, MSPeakLists = mslists_filtered_NPb5, formulas = formulas_filtered_NPb5, compounds = compounds_Comptox_filtered_NPb5)

  1. Is it possible to generate annSimComp (annotation similarity scores) for a non-suspect workflow for the list of my derived compounds from Comptox? I've tried to search for the function, but could only find it as filter parameter for the function ScreenSuspects().

Thank you so much once again for your help! I'm looking forward to the newly updated version of patRoon too!

Sincerely, Theo

rickhelmus commented 2 months ago

Hi Theo,

Some quick answers back:

  1. As long as all weights are 1.0, the total score shouldn't be higher than the number of scoring terms. It would be great if you could verify this and if you see any issues please let me know!
  2. The MoNA database will always be used, irrespective of the used compound database (PubChem, CompTox etc), so this shouldn't matter.
  3. Yes, I guess using CompTox as a suspect list could indeed work :-) Expect a lot of false negatives though, but this is not unlike a regular full NTA workflow.
  4. Again, this will be part of patRoon 3.0. It is not very difficult to calculate it yourself, but you have to do some fiddling in R to get the right data. The function to calculate the similarities is here, and can be used by calling it like this (notice the 3 colons) patRoon:::annotatedMSMSSimilarity(annPL, specSimParams). The annPL parameter can be obtained by using annotatedPeakList() and specSimParams with getDefSpecSimParams(removePrecursor = TRUE).

Thanks, Rick

Theodoralee27 commented 2 months ago

Hi Rick,

Thank you so much for your responses! I really appreciate them!

I haven't gotten to try working out the scoring issue, but if I do, I will let you know.

I am starting to get concerned if my data files have been converted correctly in the first place. I retested the workflow using a methanol sample spiked with 500 ug/L of native standards (which I would consider is a very high concentration and should be detected by the QToF, with the exception of some standards that might not be compatible with the instrument method). I use an Agilent instrument, so my raw data files are in the .d extension, from which i used proteowizard (MSConvert) to convert following this https://github.com/rickhelmus/patRoon/issues/30#issuecomment-997967350 - Peak Picking > Vendor > MS Levels 1 - 2.

I ran through the entire workflow, using the native standards as my suspect screening list. From screenSuspects I was able to find 35/ 37 suspects. I realised that screenSuspects probably looks at MS1 data, based on what I had observed in the subsequent checkFeatures function. However, after running through the full workflow, the estimated ID Levels I had obtained were 4b - 5 which should not have been the case since they are actual standards, and had good peaks in the checkFeatures. Therefore, now I am wondering if i had made a mistake in the proteowizard conversion, such that the MS2 data in each file could not be read properly?

Do you have any advise on how i can ensure the file was converted to mzml correctly? I tried opening the mzml in notepad but I couldn't really find any clues there.. When I used plotSpectrum, i am able to see some evidence of the fragments being used. Metformin had a score 4b, while acetaminophen had a score of 5. Plotspectrum Metformin Plotspectrum Acetaminophen

I will attach a sample of the mzml file here. Here is the workflow I used:

anaInfo_MSMS_positive <- generateAnalysisInfo(paths = path_MSMS_positive,
                                   groups = c(rep("meoh-blanks",9), 
                                              rep("meoh-spiked ISTD",3), 
                                              rep("meoh-spiked NS",3)),
                                   blanks = "meoh-blanks")

fListOMS_MSMS_positive  <- findFeatures(anaInfo_MSMS_positive , "openms")
fGroupsOMS_MSMS_positive <- groupFeatures(fListOMS_MSMS_positive , "openms", rtalign=TRUE)

Native_std_rt <- data.frame(name=c("Metformin","Iomeprol",...),
                            formula = c("C4H11N5","C17H22I3N3O8 ",...),
                            rt = c("58.2","204",...))
Native_std_rt$rt <- as.numeric(Native_std_rt$rt)

fGroupsSusp_OMS_MSMS_positive <- screenSuspects(fGroupsOMS_MSMS_positive , Native_std_rt, adduct = "[M+H]+", onlyHits = TRUE) 
checkFeatures(fGroupsSusp_OMS_MSMS_positive) # 35 / 37 suspects found

avgPListParams <- getDefAvgPListParams(clusterMzWindow = 0.002)
mslists_MSMS_positive <- generateMSPeakLists(fGroupsOMS_MSMS_positive, "mzr", maxMSRtWindow = 5, precursorMzWindow = 4,
                               avgFeatParams = avgPListParams, avgFGroupParams = avgPListParams)

formulas_MSMS_positive <- generateFormulas(fGroupsOMS_MSMS_positive, mslists_MSMS_positive, "genform", relMzDev = 5, adduct = "[M+H]+", elements = "CHNOP", oc = FALSE, calculateFeatures = TRUE, featThresholdAnn = 0.75)

compounds_Comptox_MSMS_positive <- generateCompounds(fGroupsOMS_MSMS_positive, mslists_MSMS_positive , "metfrag", database = "comptox", adduct = "[M+H]+")

MSMS_positive_annotateSusp <- annotateSuspects(fGroupsSusp_OMS_MSMS_positive, MSPeakLists = mslists_MSMS_positive,
                                formulas =  formulas_MSMS_positive, compounds = compounds_Comptox_MSMS_positive,
                                IDFile = system.file("misc", "IDLevelRules.yml", package = "patRoon"))
as.data.frame(MSMS_positive_annotateSusp)

siMF <- screenInfo(MSMS_positive_annotateSusp)
as.data.frame(siMF) ## results estimated ID confidence: 4b - 5

I have tried again with the pubchemlite database to see if it makes a difference, but the results were the same. Also, i used "formula" instead of SMILES and InChI key for the list of suspects because that was the only way I could create a list for the internal standards (ISTD). Will changing to SMILES help in anyway?

Thank you so much once again!

Sincerely, Theo

rickhelmus commented 2 months ago

Hi Theo,

Thanks for the details. For now I will keep my response short, because:

Also, i used "formula" instead of SMILES and InChI key for the list of suspects because that was the only way I could create a list for the internal standards (ISTD). Will changing to SMILES help in anyway?

Yes, it does! :-) The SMILES (or InChI) are required, because otherwise there is no structural information available to link the suspects to the compound annotation results. For the internal standards, you can simply omit the SMILES and only enter the formula.

Please have a look at this, if you still face issues I can have a look at the mzML and code.

Thanks, Rick

Theodoralee27 commented 2 months ago

Hi Rick,

I changed the list to SMILES and it worked! I got a range of 2a - 3a now!

I will compile the SMILES for the comptox list derived compounds and try to do the same! Thank you so much for your help!

I will update again if I had figured out anything regarding the scores!

Thank you so much!

Sincerely, Theo