segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
667 stars 39 forks source link

Control where the model is downloaded too? #59

Closed awhillas closed 1 year ago

awhillas commented 2 years ago

Hi, This is more of a minor feature request. I'm trying to use NNSplit in a container, which has a read-only file system except for the /tmp dir. It would be groovy if one could provide a local path to load the model from/download to. Perhaps this is in the python interface already but i couldn't see it.

I know you can specify a path when calling NNSplit() but this gets more complcated as I'm including it in a modele that then gets included in another project.

Anyway, nice work and thanks!

bminixhofer commented 2 years ago

Hi, sorry for being late here.

I am not sure I understand your request correctly. Are you asking for a way to customize the cache directory (currently always ~/.cache/nnsplit? If not, please elaborate (maybe with an example).

synweap15 commented 2 years ago

@bminixhofer seems like I've hit this case. I was trying to issue a NNSplit.load("en") on a server under a user with very limited access, the home dir of that user was not writable, and I was hitting Permission error 13.

Being able to optionally specify a cache directory optionally when loading would be great, and then I could point NNSplit to eg. /tmp/ as @awhillas noted. Something like NNSplit.load("en", cache_directory="/tmp")

awhillas commented 1 year ago

@bminixhofer yes, that is exactly what i'm suggesting. Preferably by specifying an environmental variable to make it Dockerfile friendly

bminixhofer commented 1 year ago

Hi! Sorry for being so quiet on this library. I have been working on a major revamp, expanding support to 85 languages, switching to a new training objective without labelled data, and switching the backbone to a BERT-style model.

The models are loaded via the Huggingface Hub, see here for how to control the cache directory: https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup