stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.14k stars 880 forks source link

ValueError: mismatching md5 value when downloading 'grc' model #1374

Closed silvia-stopponi closed 3 months ago

silvia-stopponi commented 3 months ago

Describe the bug When I download the 'grc' model the download reaches 100%, but I get the following error: ValueError: md5 for /home/my_name/stanza_resources/grc/default.zip is 7c3562a76f82045c92e8216c68ee00a0, expected 9855292e615b94b30581504c2941a96a

To Reproduce Steps to reproduce the behavior:

  1. Open the Python interpreter
    
    import stanza
    stanza.download('grc')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 379kB [00:00, 16.8MB/s] 2024-03-25 14:00:37 INFO: Downloaded file to /home/silvia/stanza_resources/resources.json 2024-03-25 14:00:37 INFO: Downloading default packages for language: grc (Ancient_Greek) ... Downloading https://huggingface.co/stanfordnlp/stanza-grc/resolve/v1.8.0/models/default.zip: 100%|█| 121M/121M [00:14<0 Traceback (most recent call last): File "", line 1, in File "/home/silvia/stanza-testenv/lib/python3.8/site-packages/stanza/resources/common.py", line 594, in download request_file( File "/home/silvia/stanza-testenv/lib/python3.8/site-packages/stanza/resources/common.py", line 154, in request_file assert_file_exists(path, md5, alternate_md5) File "/home/silvia/stanza-testenv/lib/python3.8/site-packages/stanza/resources/common.py", line 107, in assert_file_exists raise ValueError("md5 for %s is %s, expected %s" % (path, file_md5, md5)) ValueError: md5 for /home/silvia/stanza_resources/grc/default.zip is 7c3562a76f82045c92e8216c68ee00a0, expected 9855292e615b94b30581504c2941a96a



**Expected behavior**
No error at the end of download.

**Environment (please complete the following information):**
 - OS: Ubuntu for Windows
 - Python version: 3.8.10
 - Stanza version: 1.8.1

**Additional context**
I installed everything in a virtual environment. But I get the same problem in a Google Colab notebook.
It does not happen if I install stanza 1.2 instead.
AngledLuffa commented 3 months ago

Ah, sorry for the inconvenience. I had updated the POS models to use a smaller batch size to better reflect some code changes we made in the previous version, and I had forgotten to push the updated manifest with the md5 sums for the new models. It should be fixed now.

silvia-stopponi commented 3 months ago

Great, thanks, it works now! With a colleague we are presenting a Stanza-based lemmatizer for Ancient Greek next week, so we are relieved to see it working again :)

AngledLuffa commented 3 months ago

Ah, neat.

I found that the Perseus and Proiel datasets have very similar annotation schemes, to the point that if I combine the two training sets, the combined model does quite well on the two test sets:

            PR dev  PR test    PE dev   PE test
PR           97.34   97.51      75.85    71.31
PE           87.34   87.73      92.31    88.45
Combined     97.01   97.33      92.23    88.58

So that seems like a simple way to get a noticeable win for your demonstration. Want that model?

I can't speak to whether or not the tag & dependency schemes are also the same yet, but they might be

silvia-stopponi commented 3 months ago

Interesting, but were duplicated texts removed? The PROIEL treebank contains three Greek texts: The Greek New Testament, Herodotus' Histories, and Sphrantzes' Chronicles (see at https://dev.syntacticus.org/proiel.html#contents). Note that the first two works are also contained in Perseus (based on the list of works in the Perseus Digital Library website: https://www.perseus.tufts.edu/hopper/collection%3Fcollection%3DPerseus:collection:Greco-Roman). So combining the two training sets means potentially repeating a portion of data twice (and even testing twice? it depends on how the split between train, dev and test is made). I am not sure about the implications of this (overfitting?)...

silvia-stopponi commented 3 months ago

Our tool in any case is a lemmatizer for inscriptions, a different kind of data (https://github.com/agile-gronlp/agile). I am not sure abut the improvement we could have with a model with more Herodotus and more New Testament...

AngledLuffa commented 3 months ago

Well, don't tell my PI about the few hours I just spent on this :P Hope the demo works as you want... do let us know if there's anything we can do

AngledLuffa commented 3 months ago

As for how I would combine them, I'd just add the train sections together, then dev & test on one of the datasets, possible Perseus. For POS, the XPOS tags are clearly incompatible, so I'd train with just the UPOS and features from the other dataset. Not sure how compatible the dependencies would be.

silvia-stopponi commented 3 months ago

Well, don't tell my PI about the few hours I just spent on this :P Hope the demo works as you want... do let us know if there's anything we can do

Yes, after your modification everything works as expected!

silvia-stopponi commented 3 months ago

As for how I would combine them, I'd just add the train sections together, then dev & test on one of the datasets, possible Perseus. For POS, the XPOS tags are clearly incompatible, so I'd train with just the UPOS and features from the other dataset. Not sure how compatible the dependencies would be.

Technically, from PROIEL you should only use Sphranztes' Chronicles if you already have Perseus. Or, if you are using this version (https://perseusdl.github.io/treebank_data/), which is a subset of the full Perseus Greek collection, you can also add the New Testament. By adding texts that are already in Perseus (Herodotus) you do not only increase the weight of some authors/works in the training data (I am not sure about the concrete effect this has and whether it is positive or negative, I guess it depends on how the trained model is used). But you could end up using as test some data already used for training--not necessarily of course, it depends on how you divide the works between train and test for the two corpora. If that happens, the performance you obtain could be artificially inflated, at least on some tasks. Btw, on which task are you testing? POS-tagging I guess?

If you search for more training data, I found some refs to existing treebanks in this paper: https://aclanthology.org/W19-7812.pdf . Moreover, there are the GLAUx trees (https://perseids-publications.github.io/glaux-trees/ , files here: https://github.com/perseids-publications/glaux-trees/tree/master/public/xml) and the Aphtonius trees: https://github.com/polinayordanova/Treebank-of-Aphtonius-Progymnasmata . But I don't know about the compatibility of annotation between different collections (the last two should be AGDT-like), andyou can expect overlap in texts between at least some collections.

AngledLuffa commented 2 months ago

Thanks for the clarifications on the data sources for an Ancient Greek set of models. If there's interest in having models built from multiple data sources, it's definitely something we can do. Otherwise, we'll probably just leave it alone from here.

I would have to check that the dev & test sets aren't overlapping, as you point out. Normally when we've made "combined" models, the datasets are completely disjoint and it hasn't been a problem in practice. It sounds like the GRC datasets do in fact use the same data sources, though, so there would need to be a little more care taken to merge them.

silvia-stopponi commented 2 months ago

We presented our stanza-based model for inscriptions last week and we found much interest among scholars in having a well-performing lemmatizer for Ancient Greek inscriptions. However, the available manually lemmatized data usable for training are scarce: we retrained a stanza grc model on the CGRN corpus, which is not even fully lemmatized (the lemmatized tokens are 26,797). Other manually lemmatized corpora do not follow the same standard for lemmatization, the Liddell-Scott dictionary, so we could not merge the corpora to have more training data. Moreover, as far as I know, there is no corpus of inscriptions with POS-tag annotation. Does it make sense/is it possible to train a new stanza model on inscriptions (together with some literary data I suppose)? If so, would the performance in lemmatization differ in practice from an existing grc model retrained on inscriptions, as we did with the AGILe lemmatizer? Thanks!

AngledLuffa commented 2 months ago

My understanding here is the lemmatization of the inscriptions is different from that in the UD datasets? We could train a new lemmatizer if you can make the data available. If it uses the same annotation scheme, we can even train a combined model, if that would help.

silvia-stopponi commented 2 months ago

I can surely make our training data available. But the inscriptions are only (partially) lemmatized, with no other kind of annotation (no POS, no syntactic annotation). Could they be used anyway?

AngledLuffa commented 2 months ago

If all you want is a lemmatizer, we can create that from whatever data you provide. There isn't actually a code path for doing it without POS, so that would need to be an added feature. Also, obviously the predictive seq2seq model for unknown words will be better with more data and with POS tags if available