modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.63k stars 166 forks source link

update spacy to deal conflict with ms-swift #397

Closed BeachWang closed 1 month ago

BeachWang commented 1 month ago

Have test related OP, the spacy can be the latest version.

drcege commented 1 month ago

Is the model version being handled correctly? (Ref: https://spacy.io/models)

https://github.com/modelscope/data-juicer/blob/213f7f8aef078395eaefd89817b098eeb94a45b4/data_juicer/utils/model_utils.py#L409

BeachWang commented 1 month ago

Is the model version being handled correctly? (Ref: https://spacy.io/models)

https://github.com/modelscope/data-juicer/blob/213f7f8aef078395eaefd89817b098eeb94a45b4/data_juicer/utils/model_utils.py#L409

The higher version of spacy is compatible with lower version models, and there is no zh_core_web_md model, so the model will temporarily remain at 3.5.0.

drcege commented 1 month ago

Please check the following link: compatibility.json. sm, md, and lg are simply indicators of model sizes.

@HYLcool, could you also take a look at how we can adaptively handle versioning?

drcege commented 1 month ago

Pin the version to 3.7.0 and support extracting official tar.gz format.

Ideally, we should be able to automatically download the required packages via spacy.cli.download. This could potentially be considered in conjunction with issue #398.