uci-ml-repo / ucimlrepo

Python package for dataset imports from UCI ML Repository
MIT License
216 stars 90 forks source link

When will all datasets be available? #7

Closed jpgard closed 7 months ago

jpgard commented 10 months ago

Hi, thanks for your awesome work on the package. It's an invaluable resource for the community!

It looks like currently, only 53 of the ~600 datasets in the repo are available through the package. I was wondering why this was the case, and if there was a timeline for adding the remaining datasets? (Also, if you have any issue trackers or need help with this, would be happy to pitch in a bit if the team is seeking community contributions.)

ptruong0 commented 9 months ago

Hi, making datasets available in the Python package requires some manual work. In order for the package to work, each dataset needs to be accessible in the standardized format of a single, tabular csv file. I don't think that all datasets will eventually be available, since there are datasets that cannot fit into this format e.g. datasets with multiple files. However, we are working on adding more datasets over time, in order of popularity (most to least #views).

ptruong0 commented 9 months ago

Currently, dataset donors can also submit a request to make their dataset available in Python. I was thinking of opening this request to all users, would you be interested in that? That way, if any researchers or students need to use the dataset via the package on short notice, we can respond quicker.

jpgard commented 9 months ago

Got it, that makes sense (and became apparent once I thought about this issue for a little while).

It does seem like opening up a process for all users to contribute would be useful.

Also, is the process specific only to .csv files? (There are some datasets with data in formats like .data, .arff, and other extensions that are in a single file but not necessarily a CSV).

Thanks again for your work on this great tool.

ptruong0 commented 9 months ago

The process works for any data format that can be converted to csv, which typically includes extensions like .data and .arff.