FileNotFound: Pretrained file \stanza_resources\en\pretrain\fasttextcrawl.pt does not exist

FloweryScythe13 commented 2 years ago

Hi,

I am experiencing a blocking issue with some multilingual pipeline code. My code is as follows:

import stanza
from stanza.pipeline.multilingual import MultilingualPipeline

stanza.download("ar")
stanza.download("vi")
stanza.download("multilingual")
stanza.download("bg")
stanza.download("be")
stanza.download("en")
stanza.download("es")
stanza.download("he")
stanza.download("id")
stanza.download("ko")
stanza.download("pt")
stanza.download("tr")

nlp_multi = MultilingualPipeline(lang_id_config={
    "langid_clean_text": False, 
    "langid_lang_subset": ["en","ar", "es", "pt", "be", "bg", "ko", "id", "he", "ru", "th", "tr", "vi" ],
    }, 
    lang_configs={
        "en": {"processors": 'tokenize, pos, ner', "download_method": None},
        "ar": {"processors": 'tokenize, ner', "download_method": None},
        "es": {"processors": 'tokenize, pos, ner', "download_method": None},
        "pt": {"processors": 'tokenize, pos, ner', "download_method": None},
        "be": {"processors": 'tokenize, ner', "download_method": None},
        "bg": {"processors": 'tokenize, ner', "download_method": None},
        "he": {"processors": 'tokenize, ner', "download_method": None},
        "id": {"processors": 'tokenize, ner', "download_method": None},
        "ko": {"processors": 'tokenize, ner', "download_method": None},
        "th": {"processors": 'tokenize, ner', "download_method": None},
        "tr": {"processors": 'tokenize, ner', "download_method": None},
        "vi": {"processors": 'tokenize, ner', "download_method": None}
    }, max_cache_size=15 )

# this is a Pandas Series FWIW
docs_series = fb_df['description'][fb_df['description'].notna()] 
docs_list = docs_series.to_list()

langed_docs = nlp_multi(docs_list)

This is the error I am getting:

2022-10-12 22:05:56 INFO: Loading these models for language: en (English):
=========================
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| ner       | ontonotes |
=========================

2022-10-12 22:05:56 INFO: Use device: cpu
2022-10-12 22:05:56 INFO: Loading: tokenize
2022-10-12 22:05:56 INFO: Loading: pos
2022-10-12 22:05:57 INFO: Loading: ner
FileNotFoundError                         Traceback (most recent call last)
e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\pipeline\core.py in __init__(self, lang, dir, package, processors, logging_level, verbose, use_gpu, model_dir, download_method, resources_url, resources_branch, resources_version, proxies, **kwargs)
    279                                                                                           pipeline=self,
--> 280                                                                                           use_gpu=self.use_gpu)
    281             except ProcessorRequirementsException as e:

e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\pipeline\processor.py in __init__(self, config, pipeline, use_gpu)
    172         if not hasattr(self, '_variant'):
--> 173             self._set_up_model(config, pipeline, use_gpu)
    174 

e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\pipeline\ner_processor.py in _set_up_model(self, config, pipeline, use_gpu)
     48                     'charlm_backward_file': charlm_backward}
---> 49             trainer = Trainer(args=args, model_file=model_path, pretrain=pretrain, use_cuda=use_gpu, foundation_cache=pipeline.foundation_cache)
     50             self.trainers.append(trainer)

e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\models\ner\trainer.py in __init__(self, args, vocab, pretrain, model_file, use_cuda, train_classifier_only, foundation_cache)
     70             # load everything from file
---> 71             self.load(model_file, pretrain, args, foundation_cache)
     72         else:

e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\models\ner\trainer.py in load(self, filename, pretrain, args, foundation_cache)
    167         if pretrain is not None:
--> 168             emb_matrix = pretrain.emb
    169 

e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\models\common\pretrain.py in emb(self)
     49         if not hasattr(self, '_emb'):
---> 50             self.load()
     51         return self._emb

e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\models\common\pretrain.py in load(self)
     70             if not self._vec_filename and not self._csv_filename:
---> 71                 raise FileNotFoundError("Pretrained file {} does not exist, and no text/xz file was provided".format(self.filename))
     72             if self.filename is not None:

FileNotFoundError: Pretrained file E:\repos\stanza_resources\en\pretrain\fasttextcrawl.pt does not exist, and no text/xz file was provided

During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
<ipython-input-15-c92758b16980> in <module>
      4 
      5 
----> 6 langed_docs = nlp_multi(docs_list)

e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\pipeline\multilingual.py in __call__(self, doc)
    127 
    128     def __call__(self, doc):
--> 129         doc = self.process(doc)
    130         return doc
    131 

e:\Users\Eric\miniconda3\envs\arcgis_env\lib\site-packages\stanza\pipeline\multilingual.py in process(self, doc)
...
--> 183     p = os.fspath(p)
    184     seps = _get_bothseps(p)
    185     d, p = splitdrive(p)

TypeError: expected str, bytes or os.PathLike object, not list

After looking at several other closed issues referencing the FileNotFoundError exception, I did double-check and rerun stanza.download("en"). No effect.

The only file present in the above-referenced \stanza_resources\en\pretrain\ directory is combined.pt.

Also, as a potentially important note, I first wrote this code back in late June/early July, and the above pipeline code ran successfully at that time (if not quite in the ways I wanted from a multilingual standpoint, but that's another matter). It is only now that I am returning to it (and after creating a new replacement conda environment) that this FileNotFoundError exception is being thrown. Perhaps a change in the last two minor releases is the reason for this exception?

Environment (please complete the following information):

OS: Windows Python version: 3.7.11 Stanza version: 1.4.1 and 1.4.2 (tried both, installed from both Miniconda and Pip).

AngledLuffa commented 2 years ago

I just ran this on my Linux desktop without the pandas, and it worked fine. Obviously I'll need to try it on Windows without the pandas, but if you would confirm that it fails when you replace docs_list with something like "Stop making new bugs Luffa" that would be helpful

AngledLuffa commented 2 years ago

Ah, I can recreate it by not having the expected word vectors file for the NER in the expected location. Both on Windows and Linux, and the input format doesn't matter, as expected.

Basically, originally the NER models had their own separate copy of the embeddings in the model itself, whereas the other models all downloaded the embeddings separately. I separated the NER models into pretrained embeddings and everything else, kind of like when Capt. Janeway murdered Tuvix. The benefit was that most languages would have smaller downloads, since now the NER models would be much smaller and reuse the same embeddings as the POS models. This isn't true for English, though, which uses a different embedding for NER and POS/depparse.

Unfortunately, while I set the one language Pipeline to download needed models when they are missing, I apparently didn't do that for the MultilingualPipeline. Also, that embedding isn't included in the default.zip for whatever reason.

Simple fix for now: add this to your script

stanza.download("en", processors="ner")

long term, I'll fix both of those issues above

AngledLuffa commented 2 years ago

the .zip building script is updated to include the extra embeddings when needed, and I pushed those zips to the repo (hoped to do it all sneaky style, but someone tried to download the Russian models in the middle of the download https://github.com/stanfordnlp/stanza-resources/issues/10#issuecomment-1277311428)

https://github.com/stanfordnlp/stanza/commit/435685f875766e0b9b2b9b1d4792db1c452f9722

AngledLuffa commented 2 years ago

As for the downloading, I just noticed that you have download_method set to None. Actually, if you switch that to "download_method": stanza.DownloadMethod.REUSE_RESOURCES} in your configs, it will download missing pieces without re-downloading the .json every time.

FloweryScythe13 commented 2 years ago

This got me going again. Thank you very much!

stanfordnlp / stanza

FileNotFound: Pretrained file \stanza_resources\en\pretrain\fasttextcrawl.pt does not exist #1142