ipb-halle / MetFragRelaunched

Relaunch of the initial MetFrag project.
http://ipb-halle.github.io/MetFrag/
17 stars 16 forks source link

Differences between results from the web app and the command line tool #12

Open KelseyChetnik opened 6 years ago

KelseyChetnik commented 6 years ago

I’ve noticed that the results I get using the command line tool aren’t the same as the results I get using the web app. These are the first 10 results from the web app and the command line, respectively: MetFrag_CL_Web_Compare.xlsx

I’ve highlighted a few corresponding compounds so you can see how each of the tools is ranking them. Could someone explain to me: -Why these compounds are being ranked differently? -Why isn’t the command line tool providing a value for the IUPAC Names? -For the compound with identifier 4142084 and the compound with identifier 4996475, why are the molecular formulas different between the web app and the command line?

Additional information: I’m also including the parameter file I used on the command line. This file was downloaded directly from the web app – I did not alter it in any way. MetFragWeb_Parameters.txt

The command line tool version I’m using is: MetFrag2.4.2-CL.jar This is the output from the command line tool:

metfragcl_output

These are screenshots of how the web app was set / the values I entered:

metfragwebapp1 metfragwebapp2
c-ruttkies commented 6 years ago

Hi, thanks for the comment. Could you send me a feedback message with the data directly from the web tool via the feedback button on the right side? The MS2 peak list is necessary to debug your issue.

Let me try to answer your questions: -Why these compounds are being ranked differently? A: Therefore I would need the peak list.

-Why isn’t the command line tool providing a value for the IUPAC Names? A: The web version uses a local mirror of PubChem where the IUPAC names are available. I will add the query for UIPAC names also for the online PubChem version which is used in the CL version. This property should be added then.

-For the compound with identifier 4142084 and the compound with identifier 4996475, why are the molecular formulas different between the web app and the command line? A: Yes, this indeed shouldn't be the case. I will fix that asap.

I will provide an updated version after I fixed the issues and let you know. As I mentioned, it would be nice, if you could provide the complete data with the feedback function from the web interface.

Thank you.

Best regards, Christoph

schymane commented 6 years ago

Regarding the molecular formulas, both those compounds you pointed out are charged in PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/4996475#section=Top https://pubchem.ncbi.nlm.nih.gov/compound/4142084#section=Top C7H8N3O3- and C7H8N3O3+ Christoph will have to confirm, but one formula may come from the internal processing and one from the PubChem mirror - however take care with these entries are they are likely to be found as different adducts to what you have searched. From the files you sent it looks like the Web App is presenting you the "neutral" formula, i.e. adding a H to one and subtracting from the other. The Command Line has the formula as in the InChI, which is not charge adjusted as the charge comes towards the end of the string.

We are working on ways to handle charged species better in the future and we hope to have a first release of this this available in the next few months.

KelseyChetnik commented 6 years ago

Thank you for your responses.

Regarding the ranking issue, here is the peak list that goes along with the parameters file: MetFragWeb_Peaklist.txt

I also went to back to the web interface and submitted the complete data with the feedback function. Did you receive this?

schymane commented 6 years ago

Do you have the same number of candidates for each of the results file? Since this scoring is relative, this could be one reason for differences in the results. However, 429460 holds a clue … the number of peaks explained is 2 in the command line and 3 in the web app, so the actual fragmentation results are different – at least for this one example. Christoph will have to help out there, I’ve not worked with v2.4 yet (he should have received the feedback data, I don’t).

KelseyChetnik commented 6 years ago

No, they are different. You can see them in the screenshots. For the web app, the number of candidates was 2516 and the number of results processed was 354. For the command line tool, the number of candidates was 2910 and the number of results processed was 366.

schymane commented 6 years ago

The score for each category (you have 4 marked) that is summarized in the FinalScore is relative to the max and min of the values for each category. For the SuspectListScore this is always 0 (not present) or 1 (present). There don’t appear to be any of those substances in the MS library by InChIKey first block, so it seems the ExactSpectralSimilarity is always 0. So the ordering of your candidates is determined by adding SuspectListScore, FragmenterScore and SpectralSimilarity, each scaled between 0 and 1 over all the candidates. SpectralSimilarity is the MetFusion-type scoring – this changes with candidate numbers … in the Web App results in your excel this is the column “SpectralSimilarity” and in the Command Line output the “OfflineMetFusionScore” – these values vary slightly for your highlighted candidates and, because they are quite small, scaling between 0 and 1 will likely enhance the differences. The biggest difference in the Top 10 appears to be the difference in your fragmentation results in this case (which Christoph will have to trace for you).

You will see on the web app that the three noon-zero scores for the top candidate are all scaled to 1 (almost) to give the final score of 2.9891. Thus, the SpectralSimilarity score of 0.3479 has been scaled to (close to) 1, the SuspectList score is 1 and the MetFrag Fragmentation score is also 1; the raw value (FragmenterScore) of 73.282 is printed out into the results. On the web app you can also click on the individual score per candidate to see the raw values (maybe this will help you reconcile the results file and what you see a bit easier, as it is in the right order). We export the raw values so people can re-scale the results if they wish (this may come in handy e.g. when some substances have an extremely high number of references – but you have not clicked this as an option here).

Would it be useful to have both the scaled and the raw scores in the output? This adds more columns (there are already a lot of columns), but may you help reproduce the final score (i.e. what is in column “score”)? I personally also find it useful to look at the number of fragments explained, in addition to the fragmenter score – some substances (not in this case) can have a lower raw score but more fragments explained, since the fragmenter score is a combination of the number of fragments and the energy to break the bonds. I hope that helps explain the scoring a little – let me know if you have any more questions…

c-ruttkies commented 6 years ago

Sorry for the delay. I have added a new version of the command line tool at https://msbi.ipb-halle.de/~cruttkie/metfrag/MetFrag2.4.3-CL.jar

Now, IUPAC names are included when running a PubChem query. I also updated the MetFragWeb tool at https://msbi.ipb-halle.de/MetFragBeta. This corrects the molecular formulas.

Concerning the ranking, first of all thanks @schymane for the comments. Your are absolutely right, the differences in the Top10 rankings are actually caused by the FragmenterScore and the FragmenterScores differ for CID 429460 between the web and the command line tool (55.971 to 29.176). This has now been fixed as well with the updated command line version. The web version and the command line tool are now running with the same versions.

The only thing that differs now is the PubChem database. The command line version uses the current online PubChem mirror whereas the web tool is connected to a local PubChem mirror from mid 2017, which explains the different number of candidates.

Hope this helps.

KelseyChetnik commented 6 years ago

I'm not able to open the link to the new command line tool download. It says the server isn't responding.

c-ruttkies commented 6 years ago

Sorry, we got a new firewall in our institute causing several problems. Could you have another try?

schymane commented 6 years ago

Just worked for me now! Thanks

KelseyChetnik commented 6 years ago

The link is working now, but I'm getting this error when trying to run it on the command line: no main manifest attribute, in MetFrag2.4.3-CL.jar

c-ruttkies commented 6 years ago

Ah sry, please try again. Now it should be working.