Closed albertvillanova closed 1 year ago
I have already removed the dependency on torchaudio
, once all the Audio feature functionalities are backed by soundfile
.
In relation with torch
, are we sure we want to stop supporting dataset loading scripts that use torch
. If so, what about the ones using tensorflow
?
Maybe @lhoestq can give better insight.
I was asking because, if I remember correctly, we didn't depend on torch (because we had no dataset depending on torch) before working with audio cells.
But I wouldn't want to rely on my poor memory. At one point it would be good to store statistics about the dependencies found in the scripts to be able to take decisions. Not prioritary.
I agree with you that we should track that information.
Honestly, I don't know how many loading scripts on Hub datasets depend on torch
(or tensorflow
)...
we might want to keep torchaudio after all, e.g. for https://huggingface.co/datasets/cdminix/libritts-aligned/blob/main/libritts-aligned.py
should we revert https://github.com/huggingface/datasets-server/pull/1282?
I think the relevant question is how many loading scripts use torchaudio
. I don't think we should support all packages for all scripts, but we should do sensible choices based on impact and widespread use.
Also please note that even if reverting #1282, the viewer will not work for https://huggingface.co/datasets/cdminix/libritts-aligned
alignments
, phones
Error code: DatasetModuleNotInstalledError
Exception: ImportError
Message: To be able to use cdminix/libritts-aligned, you need to install the following dependencies: alignments, phones.
Ah I see, maybe not worth it for now then
There are still 2 questions to be answered: https://github.com/huggingface/datasets-server/issues/1281#issuecomment-1571875857
In relation with
torch
, are we sure we want to stop supporting dataset loading scripts that usetorch
?If so, what about the ones using
tensorflow
?
let's keep them for now no ?
OK. Let's keep the current situation and do statistics about DatasetModuleNotInstalledError
regularly.
For example, for now:
> db.cachedResponsesBlue.aggregate([
{ $match: {error_code: "DatasetModuleNotInstalledError", "details.copied_from_artifact": {"$exists": false} } },
{ $set: { match: { $regexFind: { input: "$details.cause_message", regex: /[\'`]pip install (.+)[\'`]/ } } } },
{ $set: { library: { $arrayElemAt: ['$match.captures', 0] } } },
{ $group: { _id: "$library", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])
{ _id: 'ir-datasets', count: 434 }
{ _id: 'bioc', count: 11 }
{ _id: 'pytorch_ie', count: 11 }
{ _id: null, count: 10 }
{ _id: 'Bio', count: 6 }
{ _id: 'mwparserfromhell', count: 6 }
{ _id: 'more_itertools', count: 6 }
{ _id: 'pickle5', count: 4 }
{ _id: 'ir_datasets', count: 3 }
{ _id: 'torchaudio', count: 3 }
{ _id: 'xarray', count: 3 }
{ _id: 'spacy', count: 2 }
{ _id: 'apache_beam', count: 2 }
{ _id: 'alignments phones torchaudio', count: 2 }
{ _id: 'ffmpeg', count: 2 }
{ _id: 'boto3 botocore', count: 2 }
{ _id: 'cv2', count: 2 }
{ _id: 'evaluate', count: 1 }
{ _id: 'snowflake', count: 1 }
{ _id: 'stempeg', count: 1 }
{ _id: 'textgrids', count: 1 }
{ _id: 'pyconll', count: 1 }
{ _id: 'faiss sentence_transformers', count: 1 }
{ _id: 'ndjson', count: 1 }
{ _id: 'webdataset', count: 1 }
{ _id: 'detectron2', count: 1 }
{ _id: 'syntok', count: 1 }
{ _id: 'ijson', count: 1 }
{ _id: 'shapely', count: 1 }
{ _id: 'wikidata', count: 1 }
{ _id: 'torch, onnxruntime', count: 1 }
{ _id: 'nibabel boto3', count: 1 }
{ _id: 'sympy', count: 1 }
{ _id: '"bigbench @ https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz" # noqa: this is also required by bigbench.api.util',
count: 1 }
{ _id: 'pycocotools', count: 1 }
{ _id: 'lance', count: 1 }
{ _id: 'datasets,', count: 1 }
{ _id: 'function_parser git tree_sitter', count: 1 }
{ _id: 'ijson more_itertools', count: 1 }
{ _id: 'dataset', count: 1 }
{ _id: 'imgaug imageio', count: 1 }
{ _id: 'warcio selectolax', count: 1 }
{ _id: 's3fs xarray', count: 1 }
{ _id: 'xarray zarr gcsfs', count: 1 }
{ _id: 'sacrebleu', count: 1 }
{ _id: 'synapseclient', count: 1 }
{ _id: 'xarray s3fs', count: 1 }
{ _id: 'opustools', count: 1 }
{ _id: 'xarray netCDF4', count: 1 }
{ _id: 'wfdb', count: 1 }
{ _id: 'cassis', count: 1 }
{ _id: '#BToken is ERC20', count: 1 }
The most important ones to add could be ir-datasets, bioc and pytorch_ie.
Note that null
means that the regexp did not work:
> db.cachedResponsesBlue.aggregate([
{ $match: {error_code: "DatasetModuleNotInstalledError", "details.copied_from_artifact": {"$exists": false} } },
{ $set: { match: { $regexFind: { input: "$details.cause_message", regex: /[\'`]pip install (.+)[\'`]/ } } } },
{ $set: { library: { $arrayElemAt: ['$match.captures', 0] } } },
{ $match: { library: null } },
{ $project: { "details.cause_message": 1, _id: 0 } },
])
{ details: { cause_message: 'No module named \'datasets_modules.datasets.piEsposito--br-quad-2\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--car_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--car_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--car_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--car_v2\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--mmarco_pt_dev_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--mmarco_pt_train_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--mmarco_zh_dev_v1\'' } }
{ details: { cause_message: 'No module named \'datasets_modules.datasets.irds--argsme_1\'' } }
{ details: { cause_message: 'cannot import name \'saveopen\' from \'safetensors\' (/src/services/worker/.venv/lib/python3.9/site-packages/safetensors/__init__.py)' } }
Who wants to try to add the most required dependencies (ir-datasets, bioc and pytorch_ie)?
I can take care of this.
On the other hand, I am curious how people use ir-datasets
and integrate it with datasets
... :thinking:
I'm opening a dedicate issue for the new dependencies, and closing this issue.
I created https://github.com/huggingface/datasets-server/issues/1442, and assigned it to you @albertvillanova. Thanks!
On the other hand, I am curious how people use ir-datasets and integrate it with datasets... 🤔
Please share with us what you find
As suggested by @severo: