ML ready data - Wiki documentation

ReneRanzinger commented 7 months ago

Document the datasets (#1210) in our wiki page. Information on this page should be entry level information. For the how to section just have a short summary and a step by step instruction on the linked main page.

For each dataset list some key information that might be important for people. Source, number of data point, feature selection, what one can do with the dataset and reference to data.glygen, papers etc. This should serve as a template that we can use for further datasets. I also suggest considering have 2-4 sentences for each dataset under this headline and creating a separate page. There we could use an info box to summarize key information. You can talk with @ReneRanzinger in case of questions.

ubhuiyan commented 3 months ago

This is currently what I have regarding the ML-ready datasets https://wiki.glygen.org/ML-Ready_Datasets. Any additional feedback would be appreciated.

ReneRanzinger commented 3 months ago

Open for discussion but I think a wiki page per dataset would be good. There are several issues on just listing the datasets and linking to the data page (example: https://data.glygen.org/GLY_001045)

Github link = dead link (404)
What does glycopeak number mean?
Why do some of the GP* have two glytoucan ID
What does the the cell values of the GP columns mean?
There are are supposed to be 51 columns but they are not in the dataset
What is "age at baseline"?
If there were any transformation done on the data its not visable or documented in the dataset.

ubhuiyan commented 3 months ago

I've updated the wiki to now have the links redirect to a separate wiki page that details the datasets selected. Please let me know what you think. We can also discuss during the general meeting.

I don't have all the answers to the following questions, but here are my thoughts on some of them:

The GitHub link does appear to be dead. Karina did provide her datasets and code in the PredictMod GitHub repo. If possible, we could direct users to that page instead. Otherwise, we can make this information available on the glygen repo.
I'm not sure if this answers your exact question, but I believe Karina is referencing this figure when mentioning peaks https://link.springer.com/article/10.1007/s00125-017-4426-9/figures/1
I'm not sure why two GlyToucan IDs are listed - can do some more digging
The cell values are abundances (1-100) of plasma N-glycome
I counted the columns and I found there to be 51.
I believe this is the age during the first clinical visit before any individual develops Type II diabetes.
Karina provided all of her scripts, that do detail the transformations she conducted, within the PredictMod GitHub repo. I believe we plan to make this repo public if it isn't already. That information should be available here: https://github.com/GW-HIVE/PredictMod/tree/main/flask_backend/models/ccRCC_glycoproteomic_v1

ReneRanzinger commented 3 months ago

@ubhuiyan Maybe we can talk about this after the developer meeting tomorrow. The main point I tried to make with the questions is that I have trouble in understanding the dataset and what some columns mean. If I have trouble others will, too. And the documentation on the dataset page is not helping with this. I am certainly not asking you to "fix" Karinas dataset but moving forward (and for your own dataset) I would like to establish a "sufficent" level of documentation (wiki, README, data page ...) that people can actually use the datasets.

Another question is do we have models? I talked with Karina about this as well. If we have a model we could share them as well (data or HuggingFace). It would also be good to have a step by step instruction how to build and train a model using the data (wiki or python notebook).

ubhuiyan commented 3 months ago

Ah I see your point. Yes, I am happy to establish a protocol for documentation. I'll also keep this in mind as I continue to update the individual wiki pages for the datasets. I can stay after the developer meeting to discuss this.

I do have access to Karina's models. I have made tutorials with models I've created in the past with Python notebook. That could be the most efficient way to go about it.

ubhuiyan commented 2 months ago

Update: Will discuss this further in a developer meeting. I will create a SP folder that houses all ML-ready datasets that are pending Raja's approval. We can discuss how to integrate once a protocol has been set in place.

glygener / glygen-issues

ML ready data - Wiki documentation #1211