huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
696 stars 78 forks source link

Remove dependency on torch and torchaudio #1281

Closed albertvillanova closed 1 year ago

albertvillanova commented 1 year ago

As suggested by @severo:

should we remove the dependency to torch and torchaudio? cc @polinaeterna

albertvillanova commented 1 year ago

I have already removed the dependency on torchaudio, once all the Audio feature functionalities are backed by soundfile.

In relation with torch, are we sure we want to stop supporting dataset loading scripts that use torch. If so, what about the ones using tensorflow?

Maybe @lhoestq can give better insight.

severo commented 1 year ago

I was asking because, if I remember correctly, we didn't depend on torch (because we had no dataset depending on torch) before working with audio cells.

But I wouldn't want to rely on my poor memory. At one point it would be good to store statistics about the dependencies found in the scripts to be able to take decisions. Not prioritary.

albertvillanova commented 1 year ago

I agree with you that we should track that information.

Honestly, I don't know how many loading scripts on Hub datasets depend on torch (or tensorflow)...

lhoestq commented 1 year ago

we might want to keep torchaudio after all, e.g. for https://huggingface.co/datasets/cdminix/libritts-aligned/blob/main/libritts-aligned.py

severo commented 1 year ago

should we revert https://github.com/huggingface/datasets-server/pull/1282?

albertvillanova commented 1 year ago

I think the relevant question is how many loading scripts use torchaudio. I don't think we should support all packages for all scripts, but we should do sensible choices based on impact and widespread use.

Also please note that even if reverting #1282, the viewer will not work for https://huggingface.co/datasets/cdminix/libritts-aligned

lhoestq commented 1 year ago

Ah I see, maybe not worth it for now then

albertvillanova commented 1 year ago

There are still 2 questions to be answered: https://github.com/huggingface/datasets-server/issues/1281#issuecomment-1571875857

In relation with torch, are we sure we want to stop supporting dataset loading scripts that use torch?

If so, what about the ones using tensorflow?

lhoestq commented 1 year ago

let's keep them for now no ?

severo commented 1 year ago

OK. Let's keep the current situation and do statistics about DatasetModuleNotInstalledError regularly.

For example, for now:

> db.cachedResponsesBlue.aggregate([
  { $match: {error_code: "DatasetModuleNotInstalledError", "details.copied_from_artifact": {"$exists": false} } },
  { $set: { match: { $regexFind: { input: "$details.cause_message", regex: /[\'`]pip install (.+)[\'`]/ } } } },
  { $set: { library: { $arrayElemAt: ['$match.captures', 0] } } },
  { $group: { _id: "$library", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])
{ _id: 'ir-datasets', count: 434 }
{ _id: 'bioc', count: 11 }
{ _id: 'pytorch_ie', count: 11 }
{ _id: null, count: 10 }
{ _id: 'Bio', count: 6 }
{ _id: 'mwparserfromhell', count: 6 }
{ _id: 'more_itertools', count: 6 }
{ _id: 'pickle5', count: 4 }
{ _id: 'ir_datasets', count: 3 }
{ _id: 'torchaudio', count: 3 }
{ _id: 'xarray', count: 3 }
{ _id: 'spacy', count: 2 }
{ _id: 'apache_beam', count: 2 }
{ _id: 'alignments phones torchaudio', count: 2 }
{ _id: 'ffmpeg', count: 2 }
{ _id: 'boto3 botocore', count: 2 }
{ _id: 'cv2', count: 2 }
{ _id: 'evaluate', count: 1 }
{ _id: 'snowflake', count: 1 }
{ _id: 'stempeg', count: 1 }
{ _id: 'textgrids', count: 1 }
{ _id: 'pyconll', count: 1 }
{ _id: 'faiss sentence_transformers', count: 1 }
{ _id: 'ndjson', count: 1 }
{ _id: 'webdataset', count: 1 }
{ _id: 'detectron2', count: 1 }
{ _id: 'syntok', count: 1 }
{ _id: 'ijson', count: 1 }
{ _id: 'shapely', count: 1 }
{ _id: 'wikidata', count: 1 }
{ _id: 'torch, onnxruntime', count: 1 }
{ _id: 'nibabel boto3', count: 1 }
{ _id: 'sympy', count: 1 }
{ _id: '"bigbench @ https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz" # noqa: this is also required by bigbench.api.util',
  count: 1 }
{ _id: 'pycocotools', count: 1 }
{ _id: 'lance', count: 1 }
{ _id: 'datasets,', count: 1 }
{ _id: 'function_parser git tree_sitter', count: 1 }
{ _id: 'ijson more_itertools', count: 1 }
{ _id: 'dataset', count: 1 }
{ _id: 'imgaug imageio', count: 1 }
{ _id: 'warcio selectolax', count: 1 }
{ _id: 's3fs xarray', count: 1 }
{ _id: 'xarray zarr gcsfs', count: 1 }
{ _id: 'sacrebleu', count: 1 }
{ _id: 'synapseclient', count: 1 }
{ _id: 'xarray s3fs', count: 1 }
{ _id: 'opustools', count: 1 }
{ _id: 'xarray netCDF4', count: 1 }
{ _id: 'wfdb', count: 1 }
{ _id: 'cassis', count: 1 }
{ _id: '#BToken is ERC20', count: 1 }

The most important ones to add could be ir-datasets, bioc and pytorch_ie.

Note that null means that the regexp did not work:

> db.cachedResponsesBlue.aggregate([
  { $match: {error_code: "DatasetModuleNotInstalledError", "details.copied_from_artifact": {"$exists": false} } },
  { $set: { match: { $regexFind: { input: "$details.cause_message", regex: /[\'`]pip install (.+)[\'`]/ } } } },
  { $set: { library: { $arrayElemAt: ['$match.captures', 0] } } },
  { $match: { library: null } },
  { $project: { "details.cause_message": 1, _id: 0 } },
])
{ details: { cause_message: 'No module named \'datasets_modules.datasets.piEsposito--br-quad-2\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--car_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--car_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--car_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--car_v2\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--mmarco_pt_dev_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--mmarco_pt_train_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--mmarco_zh_dev_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--argsme_1\'' } }
{ details: { cause_message: 'cannot import name \'saveopen\' from \'safetensors\' (/src/services/worker/.venv/lib/python3.9/site-packages/safetensors/__init__.py)' } }
severo commented 1 year ago

Who wants to try to add the most required dependencies (ir-datasets, bioc and pytorch_ie)?

albertvillanova commented 1 year ago

I can take care of this.

albertvillanova commented 1 year ago

On the other hand, I am curious how people use ir-datasets and integrate it with datasets... :thinking:

severo commented 1 year ago

I'm opening a dedicate issue for the new dependencies, and closing this issue.

severo commented 1 year ago

I created https://github.com/huggingface/datasets-server/issues/1442, and assigned it to you @albertvillanova. Thanks!

On the other hand, I am curious how people use ir-datasets and integrate it with datasets... 🤔

Please share with us what you find