glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

ML ready data - Wiki documentation #1211

Open ReneRanzinger opened 7 months ago

ReneRanzinger commented 7 months ago

Document the datasets (#1210) in our wiki page. Information on this page should be entry level information. For the how to section just have a short summary and a step by step instruction on the linked main page.

For each dataset list some key information that might be important for people. Source, number of data point, feature selection, what one can do with the dataset and reference to data.glygen, papers etc. This should serve as a template that we can use for further datasets. I also suggest considering have 2-4 sentences for each dataset under this headline and creating a separate page. There we could use an info box to summarize key information. You can talk with @ReneRanzinger in case of questions.

ubhuiyan commented 3 months ago

This is currently what I have regarding the ML-ready datasets https://wiki.glygen.org/ML-Ready_Datasets. Any additional feedback would be appreciated.

ReneRanzinger commented 3 months ago

Open for discussion but I think a wiki page per dataset would be good. There are several issues on just listing the datasets and linking to the data page (example: https://data.glygen.org/GLY_001045)

ubhuiyan commented 3 months ago

I've updated the wiki to now have the links redirect to a separate wiki page that details the datasets selected. Please let me know what you think. We can also discuss during the general meeting.

I don't have all the answers to the following questions, but here are my thoughts on some of them:

ReneRanzinger commented 3 months ago

@ubhuiyan Maybe we can talk about this after the developer meeting tomorrow. The main point I tried to make with the questions is that I have trouble in understanding the dataset and what some columns mean. If I have trouble others will, too. And the documentation on the dataset page is not helping with this. I am certainly not asking you to "fix" Karinas dataset but moving forward (and for your own dataset) I would like to establish a "sufficent" level of documentation (wiki, README, data page ...) that people can actually use the datasets.

Another question is do we have models? I talked with Karina about this as well. If we have a model we could share them as well (data or HuggingFace). It would also be good to have a step by step instruction how to build and train a model using the data (wiki or python notebook).

ubhuiyan commented 3 months ago

Ah I see your point. Yes, I am happy to establish a protocol for documentation. I'll also keep this in mind as I continue to update the individual wiki pages for the datasets. I can stay after the developer meeting to discuss this.

I do have access to Karina's models. I have made tutorials with models I've created in the past with Python notebook. That could be the most efficient way to go about it.

ubhuiyan commented 2 months ago

Update: Will discuss this further in a developer meeting. I will create a SP folder that houses all ML-ready datasets that are pending Raja's approval. We can discuss how to integrate once a protocol has been set in place.