Closed megansu13 closed 5 months ago
Hi! 👋 First off, thanks so much for taking the effort to contribute to OpenML! It's always a big step to make a first contribution, so we really appreciate it.
Unfortunately, it seems that our efforts got crossed a little bit. A progress bar for MinIO downloads has been added in https://github.com/openml/openml-python/pull/1335. However, this PR does also create a progress bar when downloading ARFF files, which is nice (at least for the larger arff files).
There are also a few things about the PR that would need changing regardless (I don't advise you take this step until we see how we handle the multiple PRs, but I wanted to include it for any future work):
_download_data
function has been changed. self.data_file
and self.parquet_file
should be given a value in this function that correspond to the paths on disk. It is likely other parts of the API break if this is not done. _get_dataset_arff
and _get_dataset_parquet
don't just download the files (like your new function to replace them does), they also take care of caching and calculating checksums to verify the downloads were successful. These functionalities are crucial for our users, so we cannot simply replace them by this new function.To avoid this from happening again, if you want to contribute, the best way to go about this is to either comment on an open issue to let us know you are interested in it, or open a new issue indicating you want to set up a PR for this. That way, we can first check whether we are (still) interested in the feature, ensure that no one else has been working on it, and provide some pointers on how we think it should best be implemented.
I will go ahead and close this PR, as it would need major rewrites. This does not mean we don't appreciate the work and are not open to work in this direction! It is just for our own organisation, so we can see the state of PRs.
We would be open to having a progress bar for (large) arff file downloads to accompany #1335, but the changes should be much less invasive. I think a good place to start would be to look at this function and the functions it calls to actually download the file. If you are still interested in this, I would propose that you have a look at that, and maybe share your findings and make a suggestion in the related issue #1333 before working on a new PR. Looking forward to hearing from you :)
Reference Issue
What does this PR implement/fix? Explain your changes.
Implemented progress bar using tqdm. Within datasets/dataset.py, I imported requests and tqdm, modified _download_data() and _parse_data_from_arff() methods, as well as added run_openml.py.
Here is what run_openml.py does:
How should this PR be tested?
This PR should be tested by running: cd openml python3 run_openml.py
Any other comments?
Let me know if there is anything else I should provide, I am happy to help!