huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://huggingface.co/docs/evaluate
Apache License 2.0
1.9k stars 235 forks source link

Update python to 3.8 #571

Closed qubvel closed 3 months ago

qubvel commented 3 months ago

The latest versions of datasets and transformers require Python 3.8. Currently, CI and a package have Python 3.7 as a requirement, leading to tests failing (https://github.com/huggingface/evaluate/pull/569). I was able to reproduce the failing tests with the same dependencies installed during the CI, probably it's worth updating the package's Python requirement to Python 3.8 too.

I have updated Python to 3.8 in CI to see if it became green.

qubvel commented 3 months ago

Now, other tests are failing 1) confidence interval computed with scipy.stats.bootstrap, test pass for scipy <= 1.9.3, but fails on newer versions of scipy >= 1.10.0. Probably the test is not reliable enough. 2) The distributed metric test fails, looks similar to https://github.com/huggingface/evaluate/issues/540, https://github.com/huggingface/evaluate/issues/542. Test is passing with filelock <= 3.11.0.

I would appreciate any thoughts @lhoestq

lhoestq commented 3 months ago

Cool, thansk for updating :) Maybe we should also update setup.py to require python >= 3.8 as well. The scipy test can surely be fixed using numpy.testing.assert_almost_equal or something like that, if you want to fix it in this PR

I'm a bit less sure about the filelock issue, we can leave that for later.

qubvel commented 3 months ago

Regarding scipy, adding tolerance would be not enough, because the current confidence interval in tests (0.3333, 0.6666), but after scipy update we have (0.3355, 1.0). We can either update the test range and fix the scipy above 1.10.0 or just fix scipy under 1.9.3.

lhoestq commented 3 months ago

Ah yes indeed. We should align with the latest scipy

qubvel commented 3 months ago

Ok, I fixed scipy version, the only failing test is for the distributed metrics. Do you have any ideas on how to fix it without downgrading filelock or should we add a constraint to the version? @lhoestq maybe you can tag anyone else who can help?