huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.03k stars 144 forks source link

Local fasttext model #127

Closed jordane95 closed 6 months ago

jordane95 commented 7 months ago

It seems that current fasttext filter can only load model from remote url. Is it possible to support loading model from a local path?

guipenedo commented 7 months ago

It should also work with a local path, it should copy the model to the HF cache folder in that case I believe

jordane95 commented 7 months ago
File "/output/datatrove/src/datatrove/pipeline/filters/fasttext_filter.py", line 67, in filter
labels, scores = self.model.predict(doc.text.replace("\n", ""))
File "/output/datatrove/src/datatrove/pipeline/filters/fasttext_filter.py", line 63, in model
self._model = _FastText(model_file)
File "/opt/conda/envs/datatrove/lib/python3.10/site-packages/fasttext/FastText.py", line 98, in __init__
self.f.loadModel(model_path)
ValueError: /root/.cache/huggingface/assets/datatrove/filters/fasttext/_data_math_filter_train_cls_models_fasttext_math.bin has wrong file format!

I guess this error may relate to some problems in distributed setting? i.e., multiple workers write to one path

guipenedo commented 6 months ago

We have fixed asset loading/downloading by adding file locks in #155