openml / openml-python

Python module to interface with OpenML
https://openml.github.io/openml-python/main/
Other
279 stars 143 forks source link

Implemented progress bar in dataset download #1336

Closed megansu13 closed 5 months ago

megansu13 commented 5 months ago

Reference Issue

What does this PR implement/fix? Explain your changes.

Implemented progress bar using tqdm. Within datasets/dataset.py, I imported requests and tqdm, modified _download_data() and _parse_data_from_arff() methods, as well as added run_openml.py.

Here is what run_openml.py does:

  1. Fetching Dataset: It loads a dataset from OpenML with a specific ID (in your example, it's ID 1471, which is often the Iris dataset). The get_dataset function has the ability to download the dataset, its associated qualities, and feature metadata if not already present.
  2. Extracting Data: It retrieves the features X and the target variable y from the dataset. The target is specified by the dataset's default target attribute.
  3. Data Exploration: It prints out the first few entries of the feature set X using X.head(), gives a statistical summary of X with X.describe(), and shows the distribution of different classes/targets in y with y.value_counts().
  4. Data Splitting: It splits the data into training and testing sets with an 80-20 ratio using train_test_split. The random_state parameter ensures that the split is reproducible.
  5. Model Training: It initializes a Random Forest classifier, trains it on the training set, and then predicts on the testing set.
  6. Evaluation: It evaluates the predictions by computing the accuracy score, comparing the predicted target values y_pred against the true values y_test.
  7. Progress Bar: It demonstrates a simple progress bar using tqdm, which iterates through a range of 10, pausing for 0.5 seconds at each step.

How should this PR be tested?

This PR should be tested by running: cd openml python3 run_openml.py

Any other comments?

Let me know if there is anything else I should provide, I am happy to help!

PGijsbers commented 5 months ago

Hi! 👋 First off, thanks so much for taking the effort to contribute to OpenML! It's always a big step to make a first contribution, so we really appreciate it.

Unfortunately, it seems that our efforts got crossed a little bit. A progress bar for MinIO downloads has been added in https://github.com/openml/openml-python/pull/1335. However, this PR does also create a progress bar when downloading ARFF files, which is nice (at least for the larger arff files).

There are also a few things about the PR that would need changing regardless (I don't advise you take this step until we see how we handle the multiple PRs, but I wanted to include it for any future work):

To avoid this from happening again, if you want to contribute, the best way to go about this is to either comment on an open issue to let us know you are interested in it, or open a new issue indicating you want to set up a PR for this. That way, we can first check whether we are (still) interested in the feature, ensure that no one else has been working on it, and provide some pointers on how we think it should best be implemented.

I will go ahead and close this PR, as it would need major rewrites. This does not mean we don't appreciate the work and are not open to work in this direction! It is just for our own organisation, so we can see the state of PRs.

We would be open to having a progress bar for (large) arff file downloads to accompany #1335, but the changes should be much less invasive. I think a good place to start would be to look at this function and the functions it calls to actually download the file. If you are still interested in this, I would propose that you have a look at that, and maybe share your findings and make a suggestion in the related issue #1333 before working on a new PR. Looking forward to hearing from you :)