openml / meta

Repository for issues which are not for any one specific repository (e.g., governance, data models)
0 stars 0 forks source link

Look at data upload/curation software from students #14

Open PGijsbers opened 1 month ago

PGijsbers commented 1 month ago

Have a look at the dashboards the students made, and see what aspects we would like to keep - and if the code is salvageable.

SubhadityaMukherjee commented 1 week ago
  1. https://github.com/IwkooO/Dataset-Uploader-OpenML TL;DR I do think its worth looking at. Especially the UI and perhaps the type extraction.

Summary : Better interface in multiple steps, dataset viewer, automatic feature type extraction using OpenAI api, feature editor, this prompt (You are the creator of a dataset. You want to upload the dataset to an online repository. You are requested to provide a dataset description. Knowing the column names and their sample values you will write a concise and informative description within 250 words limit without use only ASCII standard characters.)

I could not test mostly anything (except UI) because the entire codebase is dependant on the OpenAI api to run, which needs me to put money on it now it seems. (I could modify it if it is of interest.)

My opinion - I do think the UI looks a lot more user friendly than what we have now. Automatic feature type extraction is based on a different (previous OpenML paper?) and that seems fine (sorting). The code needs a fair amount of work. and I am not certain about the OpenAI part.

SubhadityaMukherjee commented 1 week ago
  1. https://github.com/Sanderror/OpenML_Data_Cleaner TL;DR - Nice as a separate tool image Summary - Performs these actions (from the image). I believe the cryptic attribute name uses the OpenAI api and I face the same problem as the previous one with it needing me to put money on it.

My opinion - I think the tool by itself is a very nice idea. It does take a very long time to run though and I am not entirely sure if it is a good idea to integrate it with OpenML. it might be nice as a separate data processing library. As for the code, a lot needs to be done to make it maintainable and I am unsure how to speed it up without digging very deep. Something useful would be the feature type check, but perhaps the previous one is more user friendly in doing that.