shuzhao-li-lab / JMS

Json's Metabolite Services
MIT License
1 stars 1 forks source link

Cannot Search Authentic Standards #19

Closed jmmitc06 closed 1 year ago

jmmitc06 commented 1 year ago

I'm trying to annotate an Asari preferred feature table using the HMDB and the RP pos authentic standards. Both the HMDB and RP pos authentic standards are included in the package so I assumed that they should "just work".

The HMDB successfully generates annotations: KCD = knownCompoundDatabase() list_compounds = json.load(open(path_to_hmdb_json)) KCD.mass_index_list_compounds(list_compounds) KCD.build_emp_cpds_index() EED.extend_empCpd_annotation(KCD) EED.annotate_singletons(KCD)

Replacing path_to_hmdb_json with the path to the authentic standards yields an exception because the format of the standard's JSON does not match the HMDB:

` Traceback (most recent call last): File "/Users/mitchjo/Projects/PythonCentricPipelineForMetabolomics/./src/main.py", line 229, in main(args) File "/Users/mitchjo/Projects/PythonCentricPipelineForMetabolomics/./src/main.py", line 224, in main experiment.feature_table.annotate(annotation_databases, auth_std_path) File "/Users/mitchjo/Projects/PythonCentricPipelineForMetabolomics/src/FeatureTable.py", line 398, in annotate KCD.mass_index_list_compounds(list_compounds) File "/opt/homebrew/lib/python3.11/site-packages/jms/dbStructures.py", line 90, in mass_index_list_compounds k = cpd['neutralformula']+ '' + str(round(float(cpd['neutral_formula_mass']),6)) # ensuring unique formula and mass


TypeError: string indices must be integers, not 'str'
`

This is corrected by tweaking the mass_index_list_compounds call:

`
KCD.mass_index_list_compounds(list_compounds["list_of_Empirical_Compounds"])
`

This now runs without exception but generates no annotations. I can confirm this by dumping dict_empCpds and looking for "list_matches" key when annotating by authentic standards alone. In that case, list_matches does not occur, implying no annotation occured. 

I did notice that the formatting of JSON for the authentic standards does not match the format of the HMDB compounds. I hypothesize that this could be the source of the issue but I do not know. 

Please advise. How do I use the authentic standards json to generate annotations to a feature table. 
jmmitc06 commented 1 year ago

Could this be related to #4 ? That issue was opened about a year ago and is not closed, so I assume that the standards' JSON is improperly formatted.

shuzhao-li commented 1 year ago

Sounds like you need to reformat the authentic standard library. When MG opened an issue, it's usually a reminder to himself that he never got to :D

gmhhope commented 1 year ago

Sounds like you need to reformat the authentic standard library. When MG opened an issue, it's usually a reminder to himself that he never got to :D

Yes, I haven't got there yet. The authentic standard was formatted as a list of empirical compounds. So it is not exactly how it should work out like HMDB when the query peak list was searched against the list of compounds. Up to the point of the compiled list of authentic standards, I haven't tested any functionality of using it for annotation.

I have shared my repo documenting my compiling of the authentic standards. We have exchanged ideas and I think Joshua has some good ideas on how to handle it now.

jmmitc06 commented 1 year ago

As we discussed on the phone, the method for searching a database such as the HMDB could/should be made generic enough that it can search the authentic standards as well. This requires modifying the JSON representation of the known compounds from databases to store retention times and ion relations from authentic standards (as the auth std's will be recorded as a specific ion relation(s) of the authentic compound). The distinction between known compounds from databases and authentic standards can be handled in the logic or a wrapper function to the underlying search function.

A simple version of this would be easy to implement and we could convert the existing std library to the necessary format easily for a quick analysis.

However, it does occur to me that a slightly more complicated data structure will provide future compatibility. Currently, the auth_std library for each method is a separate file despite the fact that that majority of the information for the standards is invariant with respect to the method. If we stored the ion relations expected for the standard in that chromatography method and the retention times using dictionaries with the methods as keys, we can trivially handle new methods and combine all the libraries into a single file. For example, if we ever change the flow rate or switch to a different kind of RP column, we would have to make a whole new file and the potential of using the wrong file for an analysis increases. It may be more work but I think it would pay off in the end.

@gmhhope I can spearhead making these changes with the understanding that I'll first focus on getting something that works and then we can meet to decide on a future-compatible design.

shuzhao-li commented 1 year ago

Authentic standard libraries are always platform dependent, different from HMDB etc. It's fine to keep them as a separate format. You can use asari.tools.match_features.list_match_lcms_features for this.

jmmitc06 commented 1 year ago

Now that I better understand the code base and the other related projects I'm closing this issue. Although the format of the authentic standards was not usable by JMS, its a moot point, since JMS cannot search using retention time. That is now implemented in the pipeline and needs to be back-ported into JMS. Authentic standards should probably be stored in a format identical to, or similar to, reference metabolite database sources.