Open topl0305 opened 2 weeks ago
ultimately the problem here is we modified the models for the upcoming version 1.10, and you're downloading the new models with the old code. you could use the dev branch or download the version 1.9 models directly from HF if you're sure you need to do it manually
With a lot of issues, like stanza.download("lt") constantly crashing, I was forced to do it manually.
"crashing" how? like with a bad connection? it doesn't "crash" when i run it
you also don't need to do any of that
just run
nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)
it should automatically download just the models you need for the right version
If I'm using direct download - stanza.download('lt') I get next error
Traceback (most recent call last):
File "C:/Users/***/Desktop/test_nlp.py", line 2, in <module>
stanza.download('lt') # download English model
File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 599, in download
request_file(
File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 159, in request_file
assert_file_exists(path, md5, alternate_md5)
File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 112, in assert_file_exists
raise ValueError("md5 for %s is %s, expected %s" % (path, file_md5, md5))
ValueError: md5 for C:\Users\***\stanza_resources\lt\default.zip is 36e9cd4989fac42001d585dc514c2020, expected 3b1725c28eeed0cdf734bd92ec82f927
This is log file: log.txt
Was testing your suggestion -- nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)
Traceback (most recent call last):
File "C:/Users/***/Desktop/test_nlp.py", line 7, in <module>
nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)
File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\pipeline\core.py", line 252, in __init__
download_models(download_list,
File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 540, in download_models
request_file(
File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 159, in request_file
assert_file_exists(path, md5, alternate_md5)
File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 112, in assert_file_exists
raise ValueError("md5 for %s is %s, expected %s" % (path, file_md5, md5))
ValueError: md5 for C:\Users\***\stanza_resources\lt\pretrain\fasttextwiki.pt is 6996c18339716076308d957354340a61, expected 89420a04d9c0b31feb5598e17eb52f8f
That's pretty weird. If I use the github repo main branch (which is 1.9.2), download successfully downloads a file with the following md5sum, which is the expected value:
[john@localhost stanza]$ md5sum /home/john/stanza_resources/lt/default.zip
3b1725c28eeed0cdf734bd92ec82f927 /home/john/stanza_resources/lt/default.zip
I can switch branches back & forth between main & dev, and it overwrites the old models when trying to download again. At no point does it download a model with md5sum 36e9cd4989fac42001d585dc514c2020
This works on both Linux and Windows
Is it possible the download was interrupted and it got a corrupted file?
At any rate, I suggest deleting those incorrect files and trying again.
Describe the bug Was trying to use pretrained model https://huggingface.co/stanfordnlp/stanza-lt With a lot of issues, like stanza.download("lt") constantly crashing, I was forced to do it manually. So, installed and downloaded everything and used next piece of code to get the bug
import stanza config = { 'processors': 'tokenize,pos', 'lang': 'lt', 'tokenize_model_path': './stanza_resources/lt/tokenize/alksnis.pt', 'pos_model_path': './stanza_resources/lt/pos/alksnis_nocharlm.pt', 'pos_pretrain_path': './stanza_resources/lt/pretrain/fasttextwiki.pt', 'tokenize_pretokenized': True, 'download_method': None }
nlp = stanza.Pipeline(**config) # initialize neural pipeline doc = nlp("Kur einam mes su Knysliuku, didžiulė paslaptis") # run annotation over a sentence print(doc)
Expected behavior The result shoud be obvious:
Environment (please complete the following information):
Additional context At least it works after patching code in file stanza/models/pos/model.py ~90 line self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(torch.from_numpy(emb_matrix), freeze=True)) to
if type(emb_matrix) == torch.Tensor: self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(emb_matrix, freeze=True)) else: self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(torch.from_numpy(emb_matrix), freeze=True))
Not sure who is culprit - library or model.