Closed dmetola closed 3 months ago
I think there might be something missing from the WeTransfer - the embeddings file is only 615 bytes.
My first question would be, do you have any interest or capacity for turning this into a Universal Dependencies treebank? Generally speaking the treebanking effort gets some kind of mid-tier publication, which addresses your question of getting your names added to it. At that point, it's permanently (*) out there for everyone to use, including us updating models on a regular basis when there are updates to the data. If that's of interest, I can put you in touch with the relevant people or you can just look at the universaldependencies.org page to find the contact information.
*: occasionally the treebank validation requirements change, usually nothing too onerous, at which point there is a time limit of a few years to update the dataset to meet the new requirements
Hi,
I generated the embeddings with a 3 million-word corpus (that's what is available), so maybe that's the reason it is small. If that's not the case I'll take a look at it, but I was able to train the models requiring it without further issues.
I know about publishing it as a treebank. The thing is that the dataset is not mine, but from one of my collaborators, so it's on them to publish it or not. I already suggested publishing the dataset as a treebank to them, so I'll mention it again for them to consider. In the meantime, I'll take your offer on putting me in touch with the UD people, if that's OK.
Thanks!
Darío
Just to reiterate, the embeddings posted were <1K in total size, not three million words :) I believe the file was corrupted or was too large for WT to accept.
The contribute page has some explanation on how to get started:
https://universaldependencies.org/contribute.html
In general you can reach out to Dan Zeman or Joachim Nivre - I'll be happy to write an introduction email if you put me in touch with your collaborator. I can always just @ them here to see what they think (although I'm not sure it works across repos like that): @dan-zeman @jnivre
My understanding is that such a publication would probably not make it to ACL or EMNLP in this day and age unless there are novel techniques in building the dataset, but the publication would be perfectly suitable for a workshop at those venues or at a conference such as SyntaxFest. Perfectly respectable venues - at least I hope so, considering I recently published a treebanking tool at SyntaxFest! At any rate, that and posting the dataset on UD are the simplest ways I can think of to make a permanent record of your contributions, especially as part of a process independent from Stanza models.
In the short term, I will be happy to include these models in a new release of Stanza, but we'd need to coordinate on how to get the word embeddings across to us.
Here is a possible venue which includes low resource languages as a category, whose deadline gives you roughly three months to prepare:
Thanks for your response. I think that there was an error, and I don't know why I have two files. I have found another embeddings file which weighs 22MB, that should be the good one. Please confirm that, if not I'll need to sort that out. Here's the link
Thanks for the offer and for the info, I think that for now it would be nice if you could send them the intro email, and copy me dametola@gmail.com and javier.martin@unirioja.es
That way we can start working on the treebank as soon as they decide that.
I'll keep that info of journals/conferences for future reference, in case they would like to present the project there.
Please let me know if there's anything I can do for this
Thanks!
Hey, so I was able to download the embeddings and see the matrix. Is this the expected shape?
torch.Size([57186, 100])
The only problem is that the words aren't included. The pretrain format saved by Stanza would align the words and the values. Are you able to send that, or are you able to send the missing word list?
Also, I see that the rest of the folder is the data files for ANG, not the models. Is the expectation that I will rebuild the models from that? Totally fine if that's the case.
In terms of other embeddings we could use, I found a couple historic english transformers, although I'm not sure they are suitable for the particular dialect in use here:
https://huggingface.co/dbmdz/bert-base-historic-english-cased (they say it's not doing great at word prediction) https://huggingface.co/bigscience-historical-texts/bert-base-blbooks-cased https://huggingface.co/emanjavacas/MacBERTh
Perhaps those are not built with English from long enough ago. Would you take a look and tell me what you think?
Also, depending on how many tokens are in your collection (3M?), there may be some mileage in building a character model out of that. 3M is a bit on the small side, though... if you have more tokens than that, it would be likely to help.
Hi,
I think that is the correct shape, but I'm not entirely sure about it, since I'm not very experienced in pre training embeddings. I'll double check and come back to you with that.
I have the models, but I have realized that I made a mistake when training them, since I used the first embeddings file I sent you (the small one), so I guess that if I use the last one I have sent, my scores would improve. Again, I'll try that today and come back with more info.
Those embeddings you are suggesting won't work with this. Those include English from the year 1450 onwards, while this corpus is from the 5th century till the 12th century. Those periods have completely different languages (think German and English). The corpus of Old English is only made of 3M words, that is all the surviving text available. That's why I'm very concerned with having the embeddings file correct, since I haven't been able to find anything similar, and it's something very sought after by my collaborator.
By word list, do you mean the list of tokens, or the corpus in raw? I have the latter, not the former. I'll train using the pretrain scripts from Stanza again, to see if I made a mistake and I can get the embeddings right. Is there a way I can check the format is correct before sending that back to you? Is there anything else I should know from the pretrain scripts in Stanza for embeddings?
Hi again,
I think I have sorted the issue. I have pretrained the ang_embeddings.pt with the txt with the vectors. I'm attaching them
In the meantime I'm retraining the POS and depparser, so when I finish those I'll add the models. Could you confirm that the .pt file is correct? According to the pretrain script, the emb and vocab have been saved to the .pt file
Again, thanks for your help in this
Thank you, that's exactly what I meant. It has both the weights and the vectors themselves.
I have my doubts about whether 64MB is enough text to train a character model that will make a meaningful improvement, but I can try it and send that back to you if you like. You could also try going through the charlm instructions yourself - you'd need to change the part where you split the data to make it 54 MB training and 10 MB dev, for example.
https://stanfordnlp.github.io/stanza/new_language_charlm.html
Another option to consider is there are a few methods for training a transformer out of small amounts of data.
There's a method to start with a related language's transformer, then finetune on the smaller dataset for the target language. This group used that method on Ancient Greek with Greek as the starting language, but perhaps those are closer than Middle or modern English to Old English
https://huggingface.co/pranaydeeps/Ancient-Greek-BERT
There's also Gesler & Zeldes's MicroBert technique of using a model with far fewer parameters:
https://arxiv.org/abs/2212.12510
My experience with GRC was that the Greek & finetune approach worked better than the smaller transformer model for downstream tasks, but again, that relies on having a reasonably closely related starting language.
Last update (I seem to have inadvertently found a sequence of keystrokes that hits the Post or Update button without meaning to)
Since you say that you're redoing the models with the current vectors, I'll hold off on posting ANG to HF until you send back the updated models. Looking forward to it!
Hi! Thanks for your thorough response and suggestions!
First of, we already considered using some form of transformer, but there is no close enough language to Old English. Also, my collaborator is interested in this approach of building it this way, and then to start working with further data. Still, I'll take a look at the character model.
I am in the process of finishing the training of the depparser. So far, the POS tagger gets to 68%, and the depparser is doing roughly 61% at the moment. Once I finish them, I'll send those to you, together with the tokenizer and lemmatizer. I'll try again training the character model, and do a small training to see if there are improvements.
Thanks!
Alright, sounds good. I can also try the charlm on my side - should be pretty easy for me to fire it up considering my existing familiarity with the tool!
Oh wait, that wasn't the raw ANG data in the file titled ang_ewt-ud-complete.txt
, that was the raw vectors file. No worries, but certainly that isn't suitable for a charlm.
How much raw data is there, roughly speaking?
I'm attaching you all the raw data that I have available. If I'm not wrong, that's 3 million words
In general you can reach out to Dan Zeman or Joachim Nivre - I'll be happy to write an introduction email if you put me in touch with your collaborator. I can always just @ them here to see what they think (although I'm not sure it works across repos like that): @dan-zeman @jnivre
It does work :-) I confirm that we'll be more than happy to add Old English to the UD collection!
Hi,
Please find attached the pt files for the tokenizer, lemmatizer, POS tagger, and depparser.
I have a question:
Once the tagger is trained, when preparing the treebank for the depparser I'm getting the following message:
2024-03-14 11:56:46 INFO: Loading data with batch size 250...
2024-03-14 11:56:47 INFO: Start evaluation...
2024-03-14 11:56:54 INFO: UPOS XPOS UFeats AllTags
2024-03-14 11:56:54 INFO: 99.51 99.42 98.01 97.51
2024-03-14 11:56:54 INFO: POS Tagger score: ang_test 97.51
2024-03-14 11:56:54 INFO: Running tagger to retag /var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpe956v25x/ang_test.dev.gold.conllu to ./data/depparse/ang_test.dev.in.conllu
Args: ['--wordvec_dir', './extern_data/wordvec', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'predict', '--save_dir', 'saved_models/pos', '--save_name', 'ang_test_nocharlm_tagger.pt', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--eval_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpe956v25x/ang_test.dev.gold.conllu', '--output_file', './data/depparse/ang_test.dev.in.conllu']
2024-03-14 11:56:54 INFO: Running tagger in predict mode
2024-03-14 11:56:54 INFO: Loading model from: saved_models/pos/ang_test_nocharlm_tagger.pt
2024-03-14 11:56:54 DEBUG: Loaded pretrain from /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt
2024-03-14 11:56:54 INFO: Loading data with batch size 250...
2024-03-14 11:56:54 INFO: Start evaluation...
2024-03-14 11:56:55 INFO: UPOS XPOS UFeats AllTags
2024-03-14 11:56:55 INFO: 85.10 84.88 72.42 68.58
2024-03-14 11:56:55 INFO: POS Tagger score: ang_test 68.58
2024-03-14 11:56:55 INFO: Running tagger to retag /var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpe956v25x/ang_test.test.gold.conllu to ./data/depparse/ang_test.test.in.conllu
Args: ['--wordvec_dir', './extern_data/wordvec', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'predict', '--save_dir', 'saved_models/pos', '--save_name', 'ang_test_nocharlm_tagger.pt', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--eval_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpe956v25x/ang_test.test.gold.conllu', '--output_file', './data/depparse/ang_test.test.in.conllu']
2024-03-14 11:56:55 INFO: Running tagger in predict mode
2024-03-14 11:56:55 INFO: Loading model from: saved_models/pos/ang_test_nocharlm_tagger.pt
2024-03-14 11:56:55 DEBUG: Loaded pretrain from /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt
2024-03-14 11:56:55 INFO: Loading data with batch size 250...
2024-03-14 11:56:55 INFO: Start evaluation...
2024-03-14 11:56:57 INFO: UPOS XPOS UFeats AllTags
2024-03-14 11:56:57 INFO: 91.42 91.38 85.31 82.41
2024-03-14 11:56:57 INFO: POS Tagger score: ang_test 82.41
Which is the correct POS tagger score? After training the tagger, I'm getting a score of 68.58. If the scores are different when preparing the following treebank, I can guess that the same is going to happen with the depparser. Could you confirm that?
If that's the case, how can I reflect this? I have saved the terminal output of the training process for both models to generate learning curve graphs for visualization. If the scores are different from the training process when they are being evaluated again, how can I reflect the difference in scores?
Thanks!
PS, I'll contact Dan next week to ask for further information about format of the treebank, or any other information we may need to start working on that.
The three scores for prepare_depparse
are the tag scores when retagging each of the train, dev, and test sections. The train section score is suitably high, after all... When you say the score you get for the tagger is ~68, is that for running the dev set? If you then run the test set, does it get around 82? If so, that sounds like things are working as intended.
I started running the charlm training. I don't have a ton of confidence that this amount of data will produce a usable charlm, but it doesn't hurt to try. We can also try with a lower number of parameters to compensate for the data size. I'll let you know when I have something.
What name should I give your dataset / models? I could call them "dario", or stick with your name of "test", but none of that seems ideal. Probably something based on whatever UD name you want to use will work best
Glad that the UD connection is made!
Hi,
If I'm understanding correctly, you are saying that the learning curve that I'm taking is for the dev set, and then the rest of scores is just the evaluation against the three splits? I can go with that when writing down the process for their publication, if I'm correct in this.
Is there a way of replicating the evaluation process for the depparser? Is there any section in the documentation for that? If I'm adding those scores for the POS tagger, I guess I should add similar information about the depparser as well.
The 68 score is for the dev set, I haven't run the test set myself. From the extract that I shared with you, it says that the 82 is for the test set, so that's what you are referring to, I guess.
About the naming convention, not sure what my collaborator is going to name it, since the data is not mine, I'm just working on the model. You can go with "nerthus" for the moment, since that's the name of the research group I'm part of.
Thanks!
Is there a way of replicating the evaluation process for the depparser?
Just to clarify, the scores it reported when running prepare_depparse
are for the POS. If that's not well documented, I ought to clear that up...
You can evaluate specifically the dev set as part of run_pos.py
, run_depparse.py
, etc, with --score_dev
. The test set, with --score_test
.
I'm not sure if I've fully answered your questions, but please let me know if you need more.
Yes, that is what I was referring to, can I also evaluate the train set? I have tried with --score_train
but it didn't work. Maybe I made that up.
I'll make a note on the scores in the meantime
Thanks!
There's no --score_train
... could add such a thing, if you think it's relevant. You can work around it for now by doing
--score_dev --eval_file <path_to_train>
Awkwardly, the depparse also needs --gold_file <path_to_train>
. I should change that so it no longer needs both
Alright, I upgraded run_depparse.py
so it only needs one --eval_file
argument to change the file used for scoring. That should make it easier to get scores on the training set, if you want them.
I also posted three versions of our character model. I put them on HF as the nerthus1024
, nerthus512
, and nerthus256
packages. Models in which the hidden dimension is 1024, 512, and 256 wide, respectively. You should be able to rerun the POS and depparse training using them with the --charlm nerthus1024
etc options. If for some reason that doesn't work, please LMK and we'll figure it out.
I don't have particularly high hopes for them, to be honest... this would be the smallest charlm we've used so far. Still, it might be interesting to find out that it does actually help performance.
Hi!
I have tried retraining the tagger and parser with the charlm model, the 1024 one, thanks for this!
I have tried with just 500 steps, the previous versions of the tagger and parser were trained with 4000 steps. So far, with the 1024 charlm, the tagger improves 2%, but the parser doesn't, it's 5% worse.
What I think is important is that the efficiency is improved, since I don't need to run so long training sessions to achieve similar, or better, results.
Do you think that training the tagger and parser with more steps, maybe not 4000 but 1000 or 2000, is worth trying to improve the scores? The nocharlm_depparser plateaued at step 3600 of 4000, while the nocharlm_tagger plateaued at step 3100.
Another question is, is the lemma model dependent on the charlm as well? It would be nice to retrain it if it does.
Thanks for your help!
I would definitely train the parser the full training cycle to see if it improves overall. I would hate to release a model which is both larger (and therefore slower) and also less accurate.
If you have more text available, or in the future if you have more text available, I can try retraining those.
How long do those 500 steps take? Are you using a GPU?
I'll train the full cycle for both, at least the same length I took in the previous version, and compare.
The tagger, for 500 steps, took 2 hours. The parser took 1.5 hours.
I'm not using a GPU, I don't have access to one at the moment.
At some point this year (I can't give dates since I'm not working on the data) there will be more annotated data available, I think around 200-300k. There is no more raw text available for this language, as far as I know. If there is, it shouldn't be a significant amount. I'm using the whole, or almost, corpus of OE existing.
Hi,
I'm attaching the final versions of the models. I have included the lemma, pos, and depparser models trained with the charlm models, so these are the latest versions available, with the data I have at hand.
If you need anything else from me, please let me know. In the meantime, I'll contact UD for the next steps, regarding the treebank.
Thanks for your help and suggestions!
Do you have final numbers on the dependency parser? Just wondering if the charlm eventually helped it or not.
If & when you have the UD data released, I can try a couple of the smaller models on my side to see if it helps - I was thinking the charlm itself might overfit given the current amount of raw text training data.
The charlm helped for the lemmatizer, tagger and dependency parser. Not too much, but it helped.
These are the results of the depparser with charlm:
UD_Old_English-TEST UAS LAS CLAS MLAS BLEX 77.06 64.60 58.84 53.80 58.84
And these are the results without charlm:
UD_Old_English-TEST UAS LAS CLAS MLAS BLEX 73.75 62.68 55.70 51.60 55.70
We will aim at having the UD data released, but cannot say when, since that doesn't depend on me. I have already contacted Dan for that, so waiting for his response and then we'll start preparing everything.
Do you need the UD data to be released for the language to be added to the pipeline? Or that's independent?
Thanks!
Glad to hear it helped! If you include that in your writeup, the model used is the Flair character model
https://aclanthology.org/C18-1139/
We would need the raw data for recreating the models, obv. I think I should be able to post the ANG models without the data, though, today or tomorrow. Let me double check with my PI that trained models w/o the data works @manning
Good to hear that it's possible to post the models!
By raw data you mean the non-annotated one? I think I sent the annotated train, dev, and test sets previously in my first comment in this thread, but in any case I'll send the conllu files, as well as the raw one, in the following link:
I'll let you know when the UD treebank is published. My guess is that we would need to include that article in the bibliography when referring to the ANG model? In any case I'll take a look at the article itself.
Thanks!
It's up, and it should be downloadable with the current release (and of course the dev branch). LMK if it's not working in any way
Although flipping through the data, I think people will likely be not happy about the lemma annotations
1 Swa swā ‘so as, consequently’ ADV adverb Uninflected=Yes 2 advmod _ _
2 awriten āwrītan VERB main-verb Tense=Past|Uninflected=Yes|VerbForm=Part 0 root _ _
3 is bēon/wesan/sēon ‘to be’ AUX auxiliary-verb Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 2 aux:pass _ _
4 on on ‘on, upon’ (PREP) ADP adposition Uninflected=Yes 7 case _ _
5 ðæs se-sēo-ðæt (DEM) DET demonstrative-article Case=Gen|Gender=Masc|Number=Sing|PronType=DemArt 6 det _ _
The biggest downside is this type of annotation won't work at all for unknown text. The lemmatizer is a seq2seq model and isn't really appropriate for figuring out the tags as part of the lemma.
Thanks, I have tried and it works! When will it be visible on the list of available models? Just out of curiosity.
Thanks for spotting the issue with the lemmas. I wasn't too happy with that when I saw it, but I didn't work on the data. I'll let my collaborators know about this, so that they can sort that out before publishing the treebank, or adding it to UD. Should we take into consideration the guidelines in universaldependencies.org? My collaborator followed an article to annotate the data, but it wasn't from there, it was from somewhere else (not entirely sure about this).
PS. I've tried the lemmatizer with a sentence, to see if it works, and it does annotate the lemma, I haven't checked if that's correct, but it looks that what is annotated is correct. Still I'll suggest changing that.
Thanks!
PS. I've tried the lemmatizer with a sentence, to see if it works, and it does annotate the lemma, I haven't checked if that's correct, but it looks that what is annotated is correct. Still I'll suggest changing that.
This is probably because the model memorizes known lemmas. If you know of any ANG words which aren't in the training data, I wouldn't expect it to work too well.
When will it be visible on the list of available models? Just out of curiosity.
Good question. How about with the next release? There are a couple things to update between now and then, so probably in a couple weeks.
Leave this open until then?
I still want to do some trials with the model, so it makes sense that the lemmatizer won't work fine with unknown words, but let's see if that's the case.
About leaving the thread open until the next release, that sounds good to me.
Thanks!
Hi,
I just wanted to double check the state of this. I know that I said that it worked, maybe because I had the processors in my machine, but I have tried in other laptop and I'm getting "unknown language request: ang"
Any ideas on why is it happening?
Thanks!
Is the dev branch installed?
There's one last upgrade I want to merge, after which I can make a new release, so probably this weekend. At that point it will be live for everyone
On Fri, Apr 12, 2024, 5:14 AM Dario Metola Rodriguez < @.***> wrote:
Hi,
I just wanted to double check the state of this. I know that I said that it worked, maybe because I had the processors in my machine, but I have tried in other laptop and I'm getting "unknown language request: ang"
Any ideas on why is it happening?
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1365#issuecomment-2051644408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOBTQSMINWLPCNX3KLY47FYXAVCNFSM6AAAAABER6FUW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGY2DINBQHA . You are receiving this because you commented.Message ID: @.***>
In my laptop works, with the dev branch. In the other laptop I tried it didn't, I cloned the repo and pulled the dev branch, without luck. I'll wait until it's live for everyone to check again.
Thanks for the update!
Want to check a test release?
Hi!
I have tested it and it works in a directory outside where my models are stored and it works.
Only thing is that I needed to make a small change in a Python file, I was getting the following error:
(nlp) dario@192 VSCode-Projects % python trial.py
Traceback (most recent call last):
File "/Users/dario/VSCode-Projects/trial.py", line 1, in <module>
import stanza
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/__init__.py", line 1, in <module>
from stanza.pipeline.core import DownloadMethod, Pipeline
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/pipeline/core.py", line 17, in <module>
from stanza.models.common.doc import Document
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/models/common/doc.py", line 14, in <module>
import networkx as nx
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/__init__.py", line 84, in <module>
import networkx.generators
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/generators/__init__.py", line 5, in <module>
from networkx.generators.classic import *
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/generators/classic.py", line 21, in <module>
from networkx.algorithms.bipartite.generators import complete_bipartite_graph
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/algorithms/__init__.py", line 12, in <module>
from networkx.algorithms.dag import *
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/algorithms/dag.py", line 2, in <module>
from fractions import gcd
ImportError: cannot import name 'gcd' from 'fractions' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/fractions.py)
It was sorted by changing line two in dag.py to:
from math import gcd
Maybe there's a better way to deal with that issue, it seems a deprecation of fractions in Python 3.10, but in any case I though you may wanted to take a look at that.
Otherwise, everything seems to work fine on my side, I have tried all the four processors and they work without issues.
Thanks!
That's in one of the libraries we import, not our stuff. Maybe it's necessary to update networkx. What version networkx are you using?
On Sat, Apr 20, 2024, 10:12 AM Dario Metola Rodriguez < @.***> wrote:
Hi!
I have tested it and it works in a directory outside where my models are stored and it works.
Only thing is that I needed to make a small change in a Python file, I was getting the following error:
(nlp) @.** VSCode-Projects % python trial.py Traceback (most recent call last): File "/Users/dario/VSCode-Projects/trial.py", line 1, in
import stanza File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/init.py", line 1, in File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/generators/classic.py", line 21, infrom stanza.pipeline.core import DownloadMethod, Pipeline File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/pipeline/core.py", line 17, in from stanza.models.common.doc import Document File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/models/common/doc.py", line 14, in import networkx as nx File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/init.py", line 84, in import networkx.generators File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/generators/init.py", line 5, in from networkx.generators.classic import from networkx.algorithms.bipartite.generators import complete_bipartite_graph File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/algorithms/init.py", line 12, in from networkx.algorithms.dag import * File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/networkx/algorithms/dag.py", line 2, in from fractions import gcd ImportError: cannot import name 'gcd' from 'fractions' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/fractions.py) It was sorted by changing line two in dag.py to: from math import gcd
Maybe there's a better way to deal with that issue, it seems a deprecation of fractions in Python 3.10, but in any case I though you may wanted to take a look at that.
Otherwise, everything seems to work fine on my side, I have tried all the four processors and they work without issues.
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1365#issuecomment-2067732624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWK24BYMCKYMXWQHKZTY6KOYXAVCNFSM6AAAAABER6FUW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXG4ZTENRSGQ . You are receiving this because you commented.Message ID: @.***>
This is the version
Name: networkx Version: 2.0.dev20160901144005
Yeah, that's probably the issue. I have version 3.1 and they've produced versions since then as well
Hi,
I'd like to express my interest in getting Old English added as a new language for Stanza. Please find attached the link to the dataset, already split in train, test, dev; and the word vectors for it.
https://we.tl/t-DwhNCPQxEI
I have tested training the tokenizer, POS, lemmatizer and depparser.
We are some people working on this project, so how does it work to have our names added to it? Do we need to add the dataset somewhere?
If you need anything else from me, please let me know
Thanks for your help throughout this project, and for your work in general!