shuzhao-li-lab / PythonCentricPipelineForMetabolomics

Python pipeline for metabolomics data preprocessing, QC, standardization and annotation
Other
8 stars 0 forks source link

Pcpfm output - annotation_table.tsv don't have feature's original m/z and adduct type assignment in the column #76

Open gmhhope opened 3 days ago

gmhhope commented 3 days ago

annotation_table_shortened_version.xlsx

I found out in pcpfm using the command:

pcpfm build_empCpds -i /Users/gongm/Documents/projects/Adriana_HEU/main/repo_102723/script4metabo/2-pcpfm-run/data/pfmRun0905/HILICneg -tm full -em pref_for_analysis --add_singletons true
pcpfm l4_annotate -i /Users/gongm/Documents/projects/Adriana_HEU/main/repo_102723/script4metabo/2-pcpfm-run/data/pfmRun0905/HILICneg -em pref_for_analysis -nm pref_HMDB_LMSD_annotated_for_analysis
pcpfm generate_output -i /Users/gongm/Documents/projects/Adriana_HEU/main/repo_102723/script4metabo/2-pcpfm-run/data/pfmRun0905/HILICneg -em pref_HMDB_LMSD_annotated_for_analysis -tm pref_for_analysis
pcpfm report -i /Users/gongm/Documents/projects/Adriana_HEU/main/repo_102723/script4metabo/2-pcpfm-run/data/pfmRun0905/HILICneg --color_by='["batch"]' --marker_by='["Sample Type"]'

The annotation_table.tsv don't have columns that indicate how the annotation was assigned (e.g., original FID's m/z and most importantly the adduct form or any linked information that could expose the inference from m/z or the kiupu assigned ID or anything, this could be very confusing for people who don't know how to handle JSON and who are not familiar with how kiupu works, especially that is a very big JSON file in the pref_HMDB_LMSD_annotated_for_analysis_empCpds.json. So it is probably better to have an intermediate JSON file (better a table), if there isn't, that could get people a hint at how the assignment was achieved. Even though this is L4 annotation, those information could be extremely important and useful for checking any metabolites of interest that jump out in the initial annotation.

.
├── annotation_table.tsv #  This file
├── experiment.json
├── feature_table.tsv
├── pref_HMDB_LMSD_annotated_for_analysis_empCpds.json # This file is super big and probably better to get an intermediate file (e.g., tsv or json file) out for quick check.
├── pref_for_analysis_Feature_table.tsv
├── renamed
├── report.pdf
└── sample_annot_table.tsv
Screenshot 2024-10-16 at 12 19 22 AM
jmmitc06 commented 2 days ago

There a few things to break down here.

First is that we can add the isotopologue and adduct to the annotation table, that is not a problem and would be helpful; however, in the interest of preventing duplicate information, this should just be added to the output feature table NOT the annotation table else, we will have many of the same entry over and over.

Second, the way in which the annotations are generated is provided in the level column. The level 4 annotations are based on the inferred neutral mass of the empirical compound. I can see adding that output to the annotation table, but we can't add everything from the JSON to every table and this may be confusing for annotations that don't use the inferred mass.

The final point about the intermediate format is unclear to me. What is this "quick check" you want to do and what is the proposed format of this intermediate table? JSON is a nested structure, so it cannot be easily converted to a table, hence the three table format from the output. Additionally, since the JSON is large because it contains alot of information, the resulting table will just be very large too. If you want to perform a "quick check" on the output, you should write a tool that does that using the JSON file IMO. You also have the report that can summarize the annotations.

gmhhope commented 1 day ago

The annotation table should be self-sufficient, providing all the necessary bridging information, including the original m/z, adduct type, and kiupu ID. Without this, the table risks becoming confusing for biologists and chemists unfamiliar with the annotation process or the structure of the JSON file. These users should not need to dive into the JSON or complex procedures to understand how the assignments were made. Including this information directly in the table ensures that it remains informative and minimizes the risk of misunderstandings and running in caveat.