ADDING OLD ENGLISH AS NEW LANGUAGE FOR THE PIPELINE

dmetola commented 6 months ago

Hi,

I'd like to express my interest in getting Old English added as a new language for Stanza. Please find attached the link to the dataset, already split in train, test, dev; and the word vectors for it.

https://we.tl/t-DwhNCPQxEI

I have tested training the tokenizer, POS, lemmatizer and depparser.

We are some people working on this project, so how does it work to have our names added to it? Do we need to add the dataset somewhere?

If you need anything else from me, please let me know

Thanks for your help throughout this project, and for your work in general!

AngledLuffa commented 6 months ago

I think there might be something missing from the WeTransfer - the embeddings file is only 615 bytes.

My first question would be, do you have any interest or capacity for turning this into a Universal Dependencies treebank? Generally speaking the treebanking effort gets some kind of mid-tier publication, which addresses your question of getting your names added to it. At that point, it's permanently (*) out there for everyone to use, including us updating models on a regular basis when there are updates to the data. If that's of interest, I can put you in touch with the relevant people or you can just look at the universaldependencies.org page to find the contact information.

*: occasionally the treebank validation requirements change, usually nothing too onerous, at which point there is a time limit of a few years to update the dataset to meet the new requirements

dmetola commented 6 months ago

Hi,

I generated the embeddings with a 3 million-word corpus (that's what is available), so maybe that's the reason it is small. If that's not the case I'll take a look at it, but I was able to train the models requiring it without further issues.

I know about publishing it as a treebank. The thing is that the dataset is not mine, but from one of my collaborators, so it's on them to publish it or not. I already suggested publishing the dataset as a treebank to them, so I'll mention it again for them to consider. In the meantime, I'll take your offer on putting me in touch with the UD people, if that's OK.

Thanks!

Darío

AngledLuffa commented 6 months ago

Just to reiterate, the embeddings posted were <1K in total size, not three million words :) I believe the file was corrupted or was too large for WT to accept.

The contribute page has some explanation on how to get started:

https://universaldependencies.org/contribute.html

In general you can reach out to Dan Zeman or Joachim Nivre - I'll be happy to write an introduction email if you put me in touch with your collaborator. I can always just @ them here to see what they think (although I'm not sure it works across repos like that): @dan-zeman @jnivre

My understanding is that such a publication would probably not make it to ACL or EMNLP in this day and age unless there are novel techniques in building the dataset, but the publication would be perfectly suitable for a workshop at those venues or at a conference such as SyntaxFest. Perfectly respectable venues - at least I hope so, considering I recently published a treebanking tool at SyntaxFest! At any rate, that and posting the dataset on UD are the simplest ways I can think of to make a permanent record of your contributions, especially as part of a process independent from Stanza models.

In the short term, I will be happy to include these models in a new release of Stanza, but we'd need to coordinate on how to get the word embeddings across to us.

AngledLuffa commented 6 months ago

Here is a possible venue which includes low resource languages as a category, whose deadline gives you roughly three months to prepare:

https://www.aclweb.org/portal/content/7th-international-conference-natural-language-and-speech-processing

dmetola commented 6 months ago

Thanks for your response. I think that there was an error, and I don't know why I have two files. I have found another embeddings file which weighs 22MB, that should be the good one. Please confirm that, if not I'll need to sort that out. Here's the link

https://we.tl/t-2CHqdwlTqt

Thanks for the offer and for the info, I think that for now it would be nice if you could send them the intro email, and copy me dametola@gmail.com and javier.martin@unirioja.es

That way we can start working on the treebank as soon as they decide that.

I'll keep that info of journals/conferences for future reference, in case they would like to present the project there.

Please let me know if there's anything I can do for this

Thanks!

AngledLuffa commented 6 months ago

Hey, so I was able to download the embeddings and see the matrix. Is this the expected shape?

torch.Size([57186, 100])

The only problem is that the words aren't included. The pretrain format saved by Stanza would align the words and the values. Are you able to send that, or are you able to send the missing word list?

Also, I see that the rest of the folder is the data files for ANG, not the models. Is the expectation that I will rebuild the models from that? Totally fine if that's the case.

In terms of other embeddings we could use, I found a couple historic english transformers, although I'm not sure they are suitable for the particular dialect in use here:

https://huggingface.co/dbmdz/bert-base-historic-english-cased (they say it's not doing great at word prediction) https://huggingface.co/bigscience-historical-texts/bert-base-blbooks-cased https://huggingface.co/emanjavacas/MacBERTh

Perhaps those are not built with English from long enough ago. Would you take a look and tell me what you think?

Also, depending on how many tokens are in your collection (3M?), there may be some mileage in building a character model out of that. 3M is a bit on the small side, though... if you have more tokens than that, it would be likely to help.

dmetola commented 6 months ago

Hi,

I think that is the correct shape, but I'm not entirely sure about it, since I'm not very experienced in pre training embeddings. I'll double check and come back to you with that.

I have the models, but I have realized that I made a mistake when training them, since I used the first embeddings file I sent you (the small one), so I guess that if I use the last one I have sent, my scores would improve. Again, I'll try that today and come back with more info.

Those embeddings you are suggesting won't work with this. Those include English from the year 1450 onwards, while this corpus is from the 5th century till the 12th century. Those periods have completely different languages (think German and English). The corpus of Old English is only made of 3M words, that is all the surviving text available. That's why I'm very concerned with having the embeddings file correct, since I haven't been able to find anything similar, and it's something very sought after by my collaborator.

By word list, do you mean the list of tokens, or the corpus in raw? I have the latter, not the former. I'll train using the pretrain scripts from Stanza again, to see if I made a mistake and I can get the embeddings right. Is there a way I can check the format is correct before sending that back to you? Is there anything else I should know from the pretrain scripts in Stanza for embeddings?

dmetola commented 6 months ago

Hi again,

I think I have sorted the issue. I have pretrained the ang_embeddings.pt with the txt with the vectors. I'm attaching them

https://we.tl/t-1WOmZf1xdc

In the meantime I'm retraining the POS and depparser, so when I finish those I'll add the models. Could you confirm that the .pt file is correct? According to the pretrain script, the emb and vocab have been saved to the .pt file

Again, thanks for your help in this

AngledLuffa commented 6 months ago

Thank you, that's exactly what I meant. It has both the weights and the vectors themselves.

I have my doubts about whether 64MB is enough text to train a character model that will make a meaningful improvement, but I can try it and send that back to you if you like. You could also try going through the charlm instructions yourself - you'd need to change the part where you split the data to make it 54 MB training and 10 MB dev, for example.

https://stanfordnlp.github.io/stanza/new_language_charlm.html

Another option to consider is there are a few methods for training a transformer out of small amounts of data.

There's a method to start with a related language's transformer, then finetune on the smaller dataset for the target language. This group used that method on Ancient Greek with Greek as the starting language, but perhaps those are closer than Middle or modern English to Old English

https://huggingface.co/pranaydeeps/Ancient-Greek-BERT

There's also Gesler & Zeldes's MicroBert technique of using a model with far fewer parameters:

https://arxiv.org/abs/2212.12510

My experience with GRC was that the Greek & finetune approach worked better than the smaller transformer model for downstream tasks, but again, that relies on having a reasonably closely related starting language.

Last update (I seem to have inadvertently found a sequence of keystrokes that hits the Post or Update button without meaning to)

Since you say that you're redoing the models with the current vectors, I'll hold off on posting ANG to HF until you send back the updated models. Looking forward to it!

dmetola commented 6 months ago

Hi! Thanks for your thorough response and suggestions!

First of, we already considered using some form of transformer, but there is no close enough language to Old English. Also, my collaborator is interested in this approach of building it this way, and then to start working with further data. Still, I'll take a look at the character model.

I am in the process of finishing the training of the depparser. So far, the POS tagger gets to 68%, and the depparser is doing roughly 61% at the moment. Once I finish them, I'll send those to you, together with the tokenizer and lemmatizer. I'll try again training the character model, and do a small training to see if there are improvements.

Thanks!

AngledLuffa commented 6 months ago

Alright, sounds good. I can also try the charlm on my side - should be pretty easy for me to fire it up considering my existing familiarity with the tool!

AngledLuffa commented 6 months ago

Oh wait, that wasn't the raw ANG data in the file titled ang_ewt-ud-complete.txt, that was the raw vectors file. No worries, but certainly that isn't suitable for a charlm.

How much raw data is there, roughly speaking?

dmetola commented 6 months ago

I'm attaching you all the raw data that I have available. If I'm not wrong, that's 3 million words

ang.txt

dan-zeman commented 6 months ago

In general you can reach out to Dan Zeman or Joachim Nivre - I'll be happy to write an introduction email if you put me in touch with your collaborator. I can always just @ them here to see what they think (although I'm not sure it works across repos like that): @dan-zeman @jnivre

It does work :-) I confirm that we'll be more than happy to add Old English to the UD collection!

dmetola commented 6 months ago

Hi,

Please find attached the pt files for the tokenizer, lemmatizer, POS tagger, and depparser.

https://we.tl/t-luBVQZVHDr

I have a question:

Once the tagger is trained, when preparing the treebank for the depparser I'm getting the following message:

2024-03-14 11:56:46 INFO: Loading data with batch size 250...
2024-03-14 11:56:47 INFO: Start evaluation...
2024-03-14 11:56:54 INFO: UPOS  XPOS    UFeats  AllTags
2024-03-14 11:56:54 INFO: 99.51 99.42   98.01   97.51
2024-03-14 11:56:54 INFO: POS Tagger score: ang_test 97.51
2024-03-14 11:56:54 INFO: Running tagger to retag /var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpe956v25x/ang_test.dev.gold.conllu to ./data/depparse/ang_test.dev.in.conllu
  Args: ['--wordvec_dir', './extern_data/wordvec', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'predict', '--save_dir', 'saved_models/pos', '--save_name', 'ang_test_nocharlm_tagger.pt', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--eval_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpe956v25x/ang_test.dev.gold.conllu', '--output_file', './data/depparse/ang_test.dev.in.conllu']
2024-03-14 11:56:54 INFO: Running tagger in predict mode
2024-03-14 11:56:54 INFO: Loading model from: saved_models/pos/ang_test_nocharlm_tagger.pt
2024-03-14 11:56:54 DEBUG: Loaded pretrain from /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt
2024-03-14 11:56:54 INFO: Loading data with batch size 250...
2024-03-14 11:56:54 INFO: Start evaluation...
2024-03-14 11:56:55 INFO: UPOS  XPOS    UFeats  AllTags
2024-03-14 11:56:55 INFO: 85.10 84.88   72.42   68.58
2024-03-14 11:56:55 INFO: POS Tagger score: ang_test 68.58
2024-03-14 11:56:55 INFO: Running tagger to retag /var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpe956v25x/ang_test.test.gold.conllu to ./data/depparse/ang_test.test.in.conllu
  Args: ['--wordvec_dir', './extern_data/wordvec', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'predict', '--save_dir', 'saved_models/pos', '--save_name', 'ang_test_nocharlm_tagger.pt', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--eval_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpe956v25x/ang_test.test.gold.conllu', '--output_file', './data/depparse/ang_test.test.in.conllu']
2024-03-14 11:56:55 INFO: Running tagger in predict mode
2024-03-14 11:56:55 INFO: Loading model from: saved_models/pos/ang_test_nocharlm_tagger.pt
2024-03-14 11:56:55 DEBUG: Loaded pretrain from /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt
2024-03-14 11:56:55 INFO: Loading data with batch size 250...
2024-03-14 11:56:55 INFO: Start evaluation...
2024-03-14 11:56:57 INFO: UPOS  XPOS    UFeats  AllTags
2024-03-14 11:56:57 INFO: 91.42 91.38   85.31   82.41
2024-03-14 11:56:57 INFO: POS Tagger score: ang_test 82.41

Which is the correct POS tagger score? After training the tagger, I'm getting a score of 68.58. If the scores are different when preparing the following treebank, I can guess that the same is going to happen with the depparser. Could you confirm that?

If that's the case, how can I reflect this? I have saved the terminal output of the training process for both models to generate learning curve graphs for visualization. If the scores are different from the training process when they are being evaluated again, how can I reflect the difference in scores?

Thanks!

PS, I'll contact Dan next week to ask for further information about format of the treebank, or any other information we may need to start working on that.

AngledLuffa commented 6 months ago

The three scores for prepare_depparse are the tag scores when retagging each of the train, dev, and test sections. The train section score is suitably high, after all... When you say the score you get for the tagger is ~68, is that for running the dev set? If you then run the test set, does it get around 82? If so, that sounds like things are working as intended.

I started running the charlm training. I don't have a ton of confidence that this amount of data will produce a usable charlm, but it doesn't hurt to try. We can also try with a lower number of parameters to compensate for the data size. I'll let you know when I have something.

What name should I give your dataset / models? I could call them "dario", or stick with your name of "test", but none of that seems ideal. Probably something based on whatever UD name you want to use will work best

Glad that the UD connection is made!

dmetola commented 6 months ago

Hi,

If I'm understanding correctly, you are saying that the learning curve that I'm taking is for the dev set, and then the rest of scores is just the evaluation against the three splits? I can go with that when writing down the process for their publication, if I'm correct in this.

Is there a way of replicating the evaluation process for the depparser? Is there any section in the documentation for that? If I'm adding those scores for the POS tagger, I guess I should add similar information about the depparser as well.

The 68 score is for the dev set, I haven't run the test set myself. From the extract that I shared with you, it says that the 82 is for the test set, so that's what you are referring to, I guess.

About the naming convention, not sure what my collaborator is going to name it, since the data is not mine, I'm just working on the model. You can go with "nerthus" for the moment, since that's the name of the research group I'm part of.

Thanks!

AngledLuffa commented 6 months ago

Is there a way of replicating the evaluation process for the depparser?

Just to clarify, the scores it reported when running prepare_depparse are for the POS. If that's not well documented, I ought to clear that up...

You can evaluate specifically the dev set as part of run_pos.py, run_depparse.py, etc, with --score_dev. The test set, with --score_test.

I'm not sure if I've fully answered your questions, but please let me know if you need more.

dmetola commented 6 months ago

Yes, that is what I was referring to, can I also evaluate the train set? I have tried with --score_train but it didn't work. Maybe I made that up.

I'll make a note on the scores in the meantime

Thanks!

AngledLuffa commented 6 months ago

There's no --score_train... could add such a thing, if you think it's relevant. You can work around it for now by doing

--score_dev --eval_file <path_to_train>

Awkwardly, the depparse also needs --gold_file <path_to_train>. I should change that so it no longer needs both

AngledLuffa commented 6 months ago

Alright, I upgraded run_depparse.py so it only needs one --eval_file argument to change the file used for scoring. That should make it easier to get scores on the training set, if you want them.

I also posted three versions of our character model. I put them on HF as the nerthus1024, nerthus512, and nerthus256 packages. Models in which the hidden dimension is 1024, 512, and 256 wide, respectively. You should be able to rerun the POS and depparse training using them with the --charlm nerthus1024 etc options. If for some reason that doesn't work, please LMK and we'll figure it out.

I don't have particularly high hopes for them, to be honest... this would be the smallest charlm we've used so far. Still, it might be interesting to find out that it does actually help performance.

dmetola commented 6 months ago

Hi!

I have tried retraining the tagger and parser with the charlm model, the 1024 one, thanks for this!

I have tried with just 500 steps, the previous versions of the tagger and parser were trained with 4000 steps. So far, with the 1024 charlm, the tagger improves 2%, but the parser doesn't, it's 5% worse.

What I think is important is that the efficiency is improved, since I don't need to run so long training sessions to achieve similar, or better, results.

Do you think that training the tagger and parser with more steps, maybe not 4000 but 1000 or 2000, is worth trying to improve the scores? The nocharlm_depparser plateaued at step 3600 of 4000, while the nocharlm_tagger plateaued at step 3100.

Another question is, is the lemma model dependent on the charlm as well? It would be nice to retrain it if it does.

Thanks for your help!

AngledLuffa commented 6 months ago

I would definitely train the parser the full training cycle to see if it improves overall. I would hate to release a model which is both larger (and therefore slower) and also less accurate.

If you have more text available, or in the future if you have more text available, I can try retraining those.

How long do those 500 steps take? Are you using a GPU?

dmetola commented 6 months ago

I'll train the full cycle for both, at least the same length I took in the previous version, and compare.

The tagger, for 500 steps, took 2 hours. The parser took 1.5 hours.

I'm not using a GPU, I don't have access to one at the moment.

At some point this year (I can't give dates since I'm not working on the data) there will be more annotated data available, I think around 200-300k. There is no more raw text available for this language, as far as I know. If there is, it shouldn't be a significant amount. I'm using the whole, or almost, corpus of OE existing.

dmetola commented 6 months ago

Hi,

I'm attaching the final versions of the models. I have included the lemma, pos, and depparser models trained with the charlm models, so these are the latest versions available, with the data I have at hand.

https://we.tl/t-YWaO3WQXm5

If you need anything else from me, please let me know. In the meantime, I'll contact UD for the next steps, regarding the treebank.

Thanks for your help and suggestions!

AngledLuffa commented 6 months ago

Do you have final numbers on the dependency parser? Just wondering if the charlm eventually helped it or not.

If & when you have the UD data released, I can try a couple of the smaller models on my side to see if it helps - I was thinking the charlm itself might overfit given the current amount of raw text training data.

dmetola commented 6 months ago

The charlm helped for the lemmatizer, tagger and dependency parser. Not too much, but it helped.

These are the results of the depparser with charlm:

UD_Old_English-TEST UAS LAS CLAS MLAS BLEX 77.06 64.60 58.84 53.80 58.84

And these are the results without charlm:

UD_Old_English-TEST UAS LAS CLAS MLAS BLEX 73.75 62.68 55.70 51.60 55.70

We will aim at having the UD data released, but cannot say when, since that doesn't depend on me. I have already contacted Dan for that, so waiting for his response and then we'll start preparing everything.

Do you need the UD data to be released for the language to be added to the pipeline? Or that's independent?

Thanks!

AngledLuffa commented 6 months ago

Glad to hear it helped! If you include that in your writeup, the model used is the Flair character model

https://aclanthology.org/C18-1139/

We would need the raw data for recreating the models, obv. I think I should be able to post the ANG models without the data, though, today or tomorrow. Let me double check with my PI that trained models w/o the data works @manning

dmetola commented 6 months ago

Good to hear that it's possible to post the models!

By raw data you mean the non-annotated one? I think I sent the annotated train, dev, and test sets previously in my first comment in this thread, but in any case I'll send the conllu files, as well as the raw one, in the following link:

https://we.tl/t-uyd9zNQU1r

I'll let you know when the UD treebank is published. My guess is that we would need to include that article in the bibliography when referring to the ANG model? In any case I'll take a look at the article itself.

Thanks!

AngledLuffa commented 6 months ago

It's up, and it should be downloadable with the current release (and of course the dev branch). LMK if it's not working in any way

Although flipping through the data, I think people will likely be not happy about the lemma annotations

1       Swa     swā ‘so as, consequently’       ADV     adverb  Uninflected=Yes 2       advmod  _       _
2       awriten āwrītan VERB    main-verb       Tense=Past|Uninflected=Yes|VerbForm=Part        0       root    _       _
3       is      bēon/wesan/sēon ‘to be’ AUX     auxiliary-verb  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   2       aux:pass        _       _
4       on      on ‘on, upon’ (PREP)    ADP     adposition      Uninflected=Yes 7       case    _       _
5       ðæs     se-sēo-ðæt (DEM)        DET     demonstrative-article   Case=Gen|Gender=Masc|Number=Sing|PronType=DemArt        6       det     _       _

The biggest downside is this type of annotation won't work at all for unknown text. The lemmatizer is a seq2seq model and isn't really appropriate for figuring out the tags as part of the lemma.

dmetola commented 6 months ago

Thanks, I have tried and it works! When will it be visible on the list of available models? Just out of curiosity.

Thanks for spotting the issue with the lemmas. I wasn't too happy with that when I saw it, but I didn't work on the data. I'll let my collaborators know about this, so that they can sort that out before publishing the treebank, or adding it to UD. Should we take into consideration the guidelines in universaldependencies.org? My collaborator followed an article to annotate the data, but it wasn't from there, it was from somewhere else (not entirely sure about this).

PS. I've tried the lemmatizer with a sentence, to see if it works, and it does annotate the lemma, I haven't checked if that's correct, but it looks that what is annotated is correct. Still I'll suggest changing that.

Thanks!

AngledLuffa commented 6 months ago

PS. I've tried the lemmatizer with a sentence, to see if it works, and it does annotate the lemma, I haven't checked if that's correct, but it looks that what is annotated is correct. Still I'll suggest changing that.

This is probably because the model memorizes known lemmas. If you know of any ANG words which aren't in the training data, I wouldn't expect it to work too well.

When will it be visible on the list of available models? Just out of curiosity.

Good question. How about with the next release? There are a couple things to update between now and then, so probably in a couple weeks.

Leave this open until then?

dmetola commented 6 months ago

I still want to do some trials with the model, so it makes sense that the lemmatizer won't work fine with unknown words, but let's see if that's the case.

About leaving the thread open until the next release, that sounds good to me.

Thanks!

dmetola commented 5 months ago

Hi,

I just wanted to double check the state of this. I know that I said that it worked, maybe because I had the processors in my machine, but I have tried in other laptop and I'm getting "unknown language request: ang"

Any ideas on why is it happening?

Thanks!

AngledLuffa commented 5 months ago

Is the dev branch installed?

There's one last upgrade I want to merge, after which I can make a new release, so probably this weekend. At that point it will be live for everyone

On Fri, Apr 12, 2024, 5:14 AM Dario Metola Rodriguez < @.***> wrote:

Hi,

I just wanted to double check the state of this. I know that I said that it worked, maybe because I had the processors in my machine, but I have tried in other laptop and I'm getting "unknown language request: ang"

Any ideas on why is it happening?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1365#issuecomment-2051644408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOBTQSMINWLPCNX3KLY47FYXAVCNFSM6AAAAABER6FUW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGY2DINBQHA . You are receiving this because you commented.Message ID: @.***>

dmetola commented 5 months ago

In my laptop works, with the dev branch. In the other laptop I tried it didn't, I cloned the repo and pulled the dev branch, without luck. I'll wait until it's live for everyone to check again.

Thanks for the update!

AngledLuffa commented 5 months ago

Want to check a test release?

https://test.pypi.org/project/stanza/1.8.2/

dmetola commented 5 months ago

Hi!

I have tested it and it works in a directory outside where my models are stored and it works.

Only thing is that I needed to make a small change in a Python file, I was getting the following error:

(nlp) dario@192 VSCode-Projects % python trial.py
Traceback (most recent call last):
  File "/Users/dario/VSCode-Projects/trial.py", line 1, in <module>
    import stanza
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/__init__.py", line 1, in <module>
    from stanza.pipeline.core import DownloadMethod, Pipeline
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/pipeline/core.py", line 17, in <module>
    from stanza.models.common.doc import Document
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/models/common/doc.py", line 14, in <module>
    import networkx as nx
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/__init__.py", line 84, in <module>
    import networkx.generators
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/generators/__init__.py", line 5, in <module>
    from networkx.generators.classic import *
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/generators/classic.py", line 21, in <module>
    from networkx.algorithms.bipartite.generators import complete_bipartite_graph
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/algorithms/__init__.py", line 12, in <module>
    from networkx.algorithms.dag import *
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/algorithms/dag.py", line 2, in <module>
    from fractions import gcd
ImportError: cannot import name 'gcd' from 'fractions' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/fractions.py)

It was sorted by changing line two in dag.py to: from math import gcd

Maybe there's a better way to deal with that issue, it seems a deprecation of fractions in Python 3.10, but in any case I though you may wanted to take a look at that.

Otherwise, everything seems to work fine on my side, I have tried all the four processors and they work without issues.

Thanks!

AngledLuffa commented 5 months ago

That's in one of the libraries we import, not our stuff. Maybe it's necessary to update networkx. What version networkx are you using?

On Sat, Apr 20, 2024, 10:12 AM Dario Metola Rodriguez < @.***> wrote:

Hi!

I have tested it and it works in a directory outside where my models are stored and it works.

Only thing is that I needed to make a small change in a Python file, I was getting the following error:

(nlp) @.** VSCode-Projects % python trial.py Traceback (most recent call last): File "/Users/dario/VSCode-Projects/trial.py", line 1, in import stanza File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/init.py", line 1, in from stanza.pipeline.core import DownloadMethod, Pipeline File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/pipeline/core.py", line 17, in from stanza.models.common.doc import Document File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/models/common/doc.py", line 14, in import networkx as nx File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/init.py", line 84, in import networkx.generators File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/generators/init.py", line 5, in from networkx.generators.classic import File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/generators/classic.py", line 21, in from networkx.algorithms.bipartite.generators import complete_bipartite_graph File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/algorithms/init.py", line 12, in from networkx.algorithms.dag import * File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/algorithms/dag.py", line 2, in from fractions import gcd ImportError: cannot import name 'gcd' from 'fractions' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/fractions.py)

It was sorted by changing line two in dag.py to: from math import gcd

Maybe there's a better way to deal with that issue, it seems a deprecation of fractions in Python 3.10, but in any case I though you may wanted to take a look at that.

Otherwise, everything seems to work fine on my side, I have tried all the four processors and they work without issues.

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1365#issuecomment-2067732624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWK24BYMCKYMXWQHKZTY6KOYXAVCNFSM6AAAAABER6FUW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXG4ZTENRSGQ . You are receiving this because you commented.Message ID: @.***>

dmetola commented 5 months ago

This is the version

Name: networkx Version: 2.0.dev20160901144005

AngledLuffa commented 5 months ago

Yeah, that's probably the issue. I have version 3.1 and they've produced versions since then as well

stanfordnlp / stanza

ADDING OLD ENGLISH AS NEW LANGUAGE FOR THE PIPELINE #1365