stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.3k stars 896 forks source link

TypeError: expected np.ndarray (got Tensor) #1431

Open topl0305 opened 2 weeks ago

topl0305 commented 2 weeks ago

Describe the bug Was trying to use pretrained model https://huggingface.co/stanfordnlp/stanza-lt With a lot of issues, like stanza.download("lt") constantly crashing, I was forced to do it manually. So, installed and downloaded everything and used next piece of code to get the bug import stanza config = { 'processors': 'tokenize,pos', 'lang': 'lt', 'tokenize_model_path': './stanza_resources/lt/tokenize/alksnis.pt', 'pos_model_path': './stanza_resources/lt/pos/alksnis_nocharlm.pt', 'pos_pretrain_path': './stanza_resources/lt/pretrain/fasttextwiki.pt', 'tokenize_pretokenized': True, 'download_method': None }

nlp = stanza.Pipeline(**config) # initialize neural pipeline doc = nlp("Kur einam mes su Knysliuku, didžiulė paslaptis") # run annotation over a sentence print(doc)

Expected behavior The result shoud be obvious:

[ [ { "id": 1, "text": "Kur", "upos": "ADV", "xpos": "prm.l.lrgin.", "feats": "Degree=Pos|PronType=Int,Rel", "misc": "", "start_char": 0, "end_char": 3 }, ... ]

Environment (please complete the following information):

Additional context At least it works after patching code in file stanza/models/pos/model.py ~90 line self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(torch.from_numpy(emb_matrix), freeze=True)) to if type(emb_matrix) == torch.Tensor: self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(emb_matrix, freeze=True)) else: self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(torch.from_numpy(emb_matrix), freeze=True)) Not sure who is culprit - library or model.

AngledLuffa commented 2 weeks ago

ultimately the problem here is we modified the models for the upcoming version 1.10, and you're downloading the new models with the old code. you could use the dev branch or download the version 1.9 models directly from HF if you're sure you need to do it manually

With a lot of issues, like stanza.download("lt") constantly crashing, I was forced to do it manually.

"crashing" how? like with a bad connection? it doesn't "crash" when i run it

you also don't need to do any of that

just run

nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)

it should automatically download just the models you need for the right version

topl0305 commented 2 weeks ago

If I'm using direct download - stanza.download('lt') I get next error

Traceback (most recent call last):
  File "C:/Users/***/Desktop/test_nlp.py", line 2, in <module>
    stanza.download('lt') # download English model
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 599, in download
    request_file(
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 159, in request_file
    assert_file_exists(path, md5, alternate_md5)
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 112, in assert_file_exists
    raise ValueError("md5 for %s is %s, expected %s" % (path, file_md5, md5))
ValueError: md5 for C:\Users\***\stanza_resources\lt\default.zip is 36e9cd4989fac42001d585dc514c2020, expected 3b1725c28eeed0cdf734bd92ec82f927

This is log file: log.txt

Was testing your suggestion -- nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)

Traceback (most recent call last):
  File "C:/Users/***/Desktop/test_nlp.py", line 7, in <module>
    nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\pipeline\core.py", line 252, in __init__
    download_models(download_list,
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 540, in download_models
    request_file(
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 159, in request_file
    assert_file_exists(path, md5, alternate_md5)
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 112, in assert_file_exists
    raise ValueError("md5 for %s is %s, expected %s" % (path, file_md5, md5))
ValueError: md5 for C:\Users\***\stanza_resources\lt\pretrain\fasttextwiki.pt is 6996c18339716076308d957354340a61, expected 89420a04d9c0b31feb5598e17eb52f8f
AngledLuffa commented 2 weeks ago

That's pretty weird. If I use the github repo main branch (which is 1.9.2), download successfully downloads a file with the following md5sum, which is the expected value:

[john@localhost stanza]$ md5sum /home/john/stanza_resources/lt/default.zip
3b1725c28eeed0cdf734bd92ec82f927  /home/john/stanza_resources/lt/default.zip

I can switch branches back & forth between main & dev, and it overwrites the old models when trying to download again. At no point does it download a model with md5sum 36e9cd4989fac42001d585dc514c2020 This works on both Linux and Windows

Is it possible the download was interrupted and it got a corrupted file?

At any rate, I suggest deleting those incorrect files and trying again.