segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
667 stars 39 forks source link

Can nnsplit use an http proxy? #31

Closed BingLingGroup closed 1 year ago

BingLingGroup commented 3 years ago

For some reasons I can't directly fetch the resources required by nnsplit. For example,

splitter = NNSplit.load("fr")
nnsplit.ResourceError: network error fetching "model.onnx" for "fr"

I'm pretty sure this is the local network issue because when I switch to other networks, it works.

So I'm wondering if there's any method to use an http proxy instead of directly sending a network request? I've tried to set the environment variables like http_proxy and https_proxy on windows and they didn't work.

bminixhofer commented 3 years ago

Hi, thanks for the issue! nnsplit can not use a proxy at the moment. It is not entirely trivial to implement because nnsplit uses the reqwest crate for requests so there would have to be some API to pass the relevant arguments from Python. In reqwest it is quite easy to configure a proxy though (reference). I will not work on this myself in the near future but I would accept a PR!

In the meantime, you can manually download the models from models/ and load your model with:

splitter = NNSplit("models/fr/model.onnx")
bminixhofer commented 3 years ago

Let's keep this open. It's a valid issue.

bminixhofer commented 1 year ago

Hi! It's been a long while but this is now possible in a (new, revamped) version of the library via

from wtpsplit import WtP
wtp = WtP("wtp-bert-mini", from_pretrained_kwargs={"proxies": ...})

See all supported arguments here: https://huggingface.co/transformers/v2.11.0/model_doc/auto.html?highlight=from_pretrained#transformers.AutoModel.from_pretrained