stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.13k stars 880 forks source link

Am I properly using stanza offline (coref English model - Electra Large)? #1399

Open Zappandy opened 2 days ago

Zappandy commented 2 days ago

I'm currently attempting to run a pipeline I had built on my local machine with stanza on an HPC with no access to the huggingface hub or the stanza server. To bypass this, I downloaded all of the models I needed and set the download_method to None. While this seemed to work with most processors in English, the coreference processor bypassed the local files and kept trying to download the google/electra-large model.

After setting environment variables such as HF_HUB_CACHE to the corresponding path where the HF cache has been stored in the HPC and HF_HUB_OFFLINE='1', the huggingface pretrained method from the bert.py script in the models coref directory kept attempting to download files. I found out that to avoid any downloads, the parameter local_files_only in the from_pretrained method must be set to True (I tested this locally with no internet connection).

Unless I'm missing something, with the current setup I don't see how I can pass this parameter to the pre_trained methods in the ~bert.py script without explicitly doing so in the script as the config object used is not the same stanza config dictionary I defined. It seems to me that the config object that it's read in the script is fetched from the model .pt file using the torch.load method, which of course means the config won't contain the local_files_only parameter.

Am I missing something or is this an expected functionality?

AngledLuffa commented 2 days ago

Thanks, this is a good observation. So what I'm hearing is that we need some way to pass local_files_only to the code path(s) that load the transformers, right? But probably also to this line, which doesn't have any config at all:

    model = AutoModel.from_pretrained(config.bert_model).to(config.device)
Zappandy commented 1 day ago

Yes. I don't know how feasible it'd be to pass specific transformers configurations to the stanza pipeline config dictionary the user defines. This may be too much, but at least in terms of an offline mode, the local_files_only should be passed to any pre_trained method as long as the user has set a cache directory where the models and tokenizers are stored.

An alternative is just to pass the local path to the from_pretrained methods, but this is less portable.