Closed paulthemagno closed 3 years ago
By default the NER model will use cased information for the charlm (as per your intuition) and use lowercasing for the embedding, since embeddings tend to not have case information.
BUT, unless there is a compelling reason why the charlm we provide isn't sufficient, you probably just want to download the default EN models and reuse that charlm.
I suppose the run_charlm.sh script is fine, but we also left a bunch of notes on how to run the python tool directly. The short summary is that the defaults generally work, but it can be helpful to get more frequent dev scores like this:
--eval_steps 100000
and sometimes the learning goes a bit haywire at the default learning rate. If you observe the dev scores spiking up, or even worse, going to NaN, you can lower the learning rate with
--lr0 10 --lr0 5
Ok @AngledLuffa thanks. So you suggest not to touch the other default parameters in charlm.py and ner_tagger.py, don't you?
About charlm, I'm searching for Italian. I haven't seen any charlm for that language in Stanza, so I wanted to take some plain text and training on my own. If some charlm already exist, it would be great!
@AngledLuffa another question regarding the training of the charlm. I'm trying to launch charlm.sh with a Docker with 16GB of RAM allocated. I'm seeing the script is killed in this row:
lines = open(path).readlines() # reserve '\n'
data = [list(line) for line in lines] #here the script is killed after allocating 30% of lines
vocab = CharVocab(data, cutoff=cutoff)
I imagine it's because the creation of the list of characters is taking too much memory. Have you found any hack to solve this problem?
Correct, we currently don't have an Italian charlm or even an Italian NER model. It's on the list of things to do, though.
More memory or less data? I can take a look to see if it's possible to allocate less memory.
Actually yes! There is already a trick which should avoid this. Put the training file in a directory by itself and pass in the directory name, not the file name.
We'll change it so such hackery is not needed in the next version.
Better yet, if you want to be a guinea pig, you can look at the next version now:
Wow! What's new in this version? Anyway thanks @AngledLuffa I'll try this method and see if it works!
Well, we retrained all the models in ud 2.8, which included a bunch of data fixes in the models and a bunch of improvements in our ways of reading the data. There's various other bug fixes and some efficiency improvements in the pipeline. Also, I just added a fix which should make it so the single data file and multiple data files use the same code path in the charlm :)
Is this working for you? If not, we can try to work through it. Also, we'd definitely be interested in hearing about viable Italian NER datasets which we can use to train and distribute models.
I'm proceeding. Currently I'm fighting with RAM problems (too large plain texts for charlm training), but I'll increase my RAM capacity. Next I'll go with NER.
One hint that comes to my mind: I was seeing the vocabulary that charlm.py uses and I have noticed that many datasets could contain strange characters. For example I have some Italian texts that contain some Chinese, Arabic, etc. characters (probably deriving from scraping or other unknown reasons), obviously with a lower frequency compared to alfanumeric characters. Not only: I found also exadecimal characters and so on. I print here a possible frequency on a Italian text, printing the counter in this line like this:
def build_vocab(path, cutoff=0):
# Requires a large amount of memory, but only need to build once
# here we need some trick to deal with excessively large files
# for each file we accumulate the counter of characters, and
# at the end we simply pass a list of chars to the vocab builder
counter = Counter()
if os.path.isdir(path):
filenames = sorted(os.listdir(path))
else:
filenames = [path]
for filename in filenames:
lines = readlines(path + '/' + filename)
for line in lines:
counter.update(list(line))
print(counter) #added print by me
# remove infrequent characters from vocab
for k in list(counter.keys()):
if counter[k] < cutoff:
del counter[k]
# a singleton list of all characters
data = [sorted([x[0] for x in counter.most_common()])]
vocab = CharVocab(data) # skip cutoff argument because this has been dealt with
return vocab
and the most frequent characters are these:
Counter({' ': 7681209538, 'i': 4551454294, 'e': 4549456069, 'a': 4238537950, 'o': 3673461662, 'n': 2866799737, 't': 2731902667, 'r': 2600313164, 'l': 2411084490, 's': 1956850370, 'c': 1681083743, 'd': 1452379257, 'u': 1160736023, 'p': 1116443657, 'm': 1030144651, 'g': 682786699, 'v': 586241491, ',': 457130661, 'z': 454814799, 'f': 403655586, 'h': 384044936, '.': 377223176, 'b': 353087237, '\n': 147245274, 'q': 143800889, 'S': 118540273, 'I': 118217353, 'C': 116475805, 'A': 111440149, '’': 104395912, '0': 95776805, 'à': 95516545, 'è': 91654978, 'P': 91392963, '1': 90574488, 'L': 89241371, "'": 76209116, 'M': 76024315, '2': 71550984, 'E': 67419510, 'T': 60758504, 'R': 60336513, 'D': 59810296, 'N': 56366787, ':': 51479131, ')': 49107911, 'G': 48166272, '-': 47093491, '(': 46722880, 'O': 45860493, 'B': 42938064, 'F': 40740813, 'V': 34614337, '3': 34541965, 'k': 33717787, 'ù': 33647590, '5': 32703922, 'U': 32202105, 'y': 30960743, '4': 28286291, '9': 27387565, '"': 26535985, '8': 24948190, 'w': 24881007, '6': 23979966, '“': 22882357, '”': 22372757, 'ò': 22334291, '7': 22194867, '/': 19307840, '!': 17459881, 'x': 17019626, ';': 16387798, 'H': 15506209, 'Q': 15326138, 'é': 15280821, '?': 14966577, 'ì': 13754672, '–': 10816460, 'W': 9019297, '…': 8292947, 'K': 7256676, 'Z': 6658745, '%': 6545873, 'J': 6418997, '»': 5384401, '«': 4861160, 'X': 4666950, 'j': 4635443, ']': 4453657, '[': 4432564, '°': 3929110, 'È': 3341311, 'Y': 3147401, '|': 3134096, '&': 2604805, '\t': 2576813, '#': 2359625, '_': 2172140, '+': 2052629, '*': 1895329, '‘': 1878549, '>': 1762312, '=': 1716578, '�': 1638962, '€': 1466127, '@': 960594, '\xad': 920302, '·': 812468, 'о': 721241, '}': 717992, '{': 717357, '<': 709897, '•': 676679, '\x92': 650005, 'а': 622357, 'е': 586218, 'и': 571733, '\\': 535236, 'á': 534133, 'н': 475810, '^': 473781, '—': 464684, 'т': 451349, '̀': 450494, '´': 446731, 'Ã': 419609, '`': 394870, 'р': 390003, 'с': 377038, '$': 363224, '®': 355264, 'í': 347783, '→': 325266, 'ó': 318881, 'в': 305175, 'ü': 288814, 'л': 287508, 'к': 270952, 'â': 254923, '\u200b': 244063, 'ö': 221242, 'м': 217895, 'д': 201734, 'ú': 189713, 'п': 182577, 'ä': 181229, 'ا': 177980, 'у': 170989, '™': 162521, '\ufeff': 156089, 'À': 155087, '¨': 152663, 'ç': 152638, '′': 149080, 'ы': 142476, 'я': 141928, '🙂': 141483, '²': 136941, '©': 132554, 'É': 119967, 'ь': 116641, 'г': 112931, '\x93': 111265, 'з': 110647, 'б': 110050, '×': 109331, 'α': 108837, '\x94': 107482, 'ñ': 106392, 'º': 106276, 'й': 104848, 'ª': 103637, '~': 101901, '›': 100480, 'ل': 97917, '😉': 90561, 'Â': 89647, '\x85': 89577, 'っ': 84191, 'ч': 79399, '″': 78309, 'Г': 77878, 'ã': 76042, '¬': 74738, 'ر': 74519, '\u2028': 73783, 'م': 72543, 'ο': 72028, 'š': 70102, 'č': 68969, 'ê': 66888, 'و': 66824, 'ن': 66626, ',': 65865, 'ë': 64865, 'τ': 62954, '\x80': 62022, 'ı': 60967, 'ι': 60478, 'х': 60151, 'ν': 59454, 'ж': 59129, '的': 59119, 'ε': 58647, 'ï': 58620, '½': 58359, 'ي': 57403, 'ø': 57091, '彩': 56459, 'ت': 55338, 'ô': 54972, 'ā': 54674, '、': 54048, '¹': 53711, '天': 53690, 'å': 52399, 'Ø': 51737, ...
And going forward other strange characters appear. Since there is the cutoff
parameter (set to 1000 by default) I thought to remove from the vocab all these characters, seeing which was the highest frequency of non Italian characters, anyway I see that a symbol like ©
(that I would keep) has a lower frequency than д
so if I remove д
, I'll remove ©
too (this is only an example). So my question is: is it better or not to keep in the vocabulary these characters (or emoji)?
To make comparison with other models, I loaded the English forward charlm 1billion.pt
in stanza_resources
and printed the vocabulary, seeing that Chinese and other strange characters occour in the English charlm vocab too.
Generally speaking, it seems to work fine for us without any special effort to remove those characters.
I don't know if it's relevant to the memory issues you're running into, but the change to reduce the amount of memory used when building the vocab is now part of the official 1.2.1 release
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.
Hi @AngledLuffa , I tried both training using charlm and not.
Without the charlm training I was able to load the model updating resources.json
with the ner
field in it
.
After retrying with also charlm, it gives me an error:
pipeline = stanza.Pipeline("it", processors = "tokenize,ner", dir = "/path/to/stanza_resources")
2021-08-30 20:26:28 WARNING: Language it package default expects mwt, which has been added
2021-08-30 20:26:28 INFO: Loading these models for language: it (Italian):
========================
| Processor | Package |
------------------------
| tokenize | combined |
| mwt | combined |
| ner | wikiner |
========================
2021-08-30 20:26:28 INFO: Use device: cpu
2021-08-30 20:26:28 INFO: Loading: tokenize
2021-08-30 20:26:28 INFO: Loading: mwt
2021-08-30 20:26:28 INFO: Loading: ner
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 308, in _check_seekable
f.seek(f.tell())
AttributeError: 'NoneType' object has no attribute 'seek'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/core.py", line 130, in __init__
use_gpu=self.use_gpu)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/processor.py", line 155, in __init__
self._set_up_model(config, use_gpu)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/ner_processor.py", line 27, in _set_up_model
self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/trainer.py", line 40, in __init__
self.load(model_file, args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/trainer.py", line 131, in load
self.model = NERTagger(self.args, self.vocab)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/model.py", line 38, in __init__
add_unsaved_module('charmodel_forward', CharacterLanguageModel.load(args['charlm_forward_file'], finetune=False))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/common/char_model.py", line 135, in load
state = torch.load(filename, lambda storage, loc: storage)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 581, in load
with _open_file_like(f, 'rb') as opened_file:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 235, in _open_file_like
return _open_buffer_reader(name_or_buffer)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 220, in __init__
_check_seekable(buffer)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 311, in _check_seekable
raise_err_msg(["seek", "tell"], e)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 304, in raise_err_msg
raise type(e)(msg)
AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.
I suppose it is because I have to add the charlms in the Stanza config. First I have added the forward and the backward charlms to the Italian folders, but how to update resources.json
?
I'm seeing that for English, as instance, is written in this way:
"forward_charlm": {
"1billion": {
"md5": "468b3377455fa0311565d46865f55afb"
},
"mimic": {
"md5": "3d3863e27afd67bf356354a728b0fb76"
},
"pubmed": {
"md5": "3f734806be2c2f62c82fbb01421f78ec"
}
},
What do I have to write in the md5
field for my charlms?
md5 is literally the md5sum of the model files. It's to make it easier to verify that we have the right download. We actually do have IT charlms publicly available, fwiw. Of course, if your models wind up being better, that would be great
On Mon, Aug 30, 2021 at 11:34 AM Paolo Magnani @.***> wrote:
Hi @AngledLuffa https://github.com/AngledLuffa , I tried both training using charlm and not.
Without the charlm training I was able to load the model updating resources.json with the ner field in it.
After retrying with also charlm, it gives me an error:
pipeline = stanza.Pipeline("it", processors = "tokenize,ner", dir = "/path/to/stanza_resources")
2021-08-30 20:26:28 WARNING: Language it package default expects mwt, which has been added 2021-08-30 20:26:28 INFO: Loading these models for language: it (Italian):
| Processor | Package |
| tokenize | combined | | mwt | combined | | ner | wikiner |
2021-08-30 20:26:28 INFO: Use device: cpu 2021-08-30 20:26:28 INFO: Loading: tokenize 2021-08-30 20:26:28 INFO: Loading: mwt 2021-08-30 20:26:28 INFO: Loading: ner Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 308, in _check_seekable f.seek(f.tell()) AttributeError: 'NoneType' object has no attribute 'seek'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "
", line 1, in File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/core.py", line 130, in init use_gpu=self.use_gpu) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/processor.py", line 155, in init self._set_up_model(config, use_gpu) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/ner_processor.py", line 27, in _set_up_model self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/trainer.py", line 40, in init self.load(model_file, args) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/trainer.py", line 131, in load self.model = NERTagger(self.args, self.vocab) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/model.py", line 38, in init add_unsaved_module('charmodel_forward', CharacterLanguageModel.load(args['charlm_forward_file'], finetune=False)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/common/char_model.py", line 135, in load state = torch.load(filename, lambda storage, loc: storage) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 581, in load with _open_file_like(f, 'rb') as opened_file: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 235, in _open_file_like return _open_buffer_reader(name_or_buffer) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 220, in init _check_seekable(buffer) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 311, in _check_seekable raise_err_msg(["seek", "tell"], e) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 304, in raise_err_msg raise type(e)(msg) AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead. I suppose it is because I have to add the charlms in the Stanza config. First I have added the forward and the backward charlms to the Italian folders, but how to update resources.json ?
I'm seeing that for English, as instance, is written in this way:
"forward_charlm": { "1billion": { "md5": "468b3377455fa0311565d46865f55afb" }, "mimic": { "md5": "3d3863e27afd67bf356354a728b0fb76" }, "pubmed": { "md5": "3f734806be2c2f62c82fbb01421f78ec" } },
What do I have to write in the md5 field for my charlms?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/717#issuecomment-908586823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKC4HA67NQD3XUBSNTT7PFMNANCNFSM46BJ4MEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
ok, perfect. Where can I load these Italian charlm? I knew they didn't exist yet.
Ok, assuming that my forward_charlm
is called fc.pt
, my backward_charlm
is called bc.pt
and my ner tagger is called ner_tagger.pt
, I managed to make it work setting these new fields ner
, forward_charlm
and backward_charlm
, taking inspiration by the English model.
{
"it": {
"forward_charlm": {
"fc": {
"md5": "1e078546a7431fb2a2a7296a6b45ff03"
}
},
"backward_charlm": {
"bc": {
"md5": "3261357e1d6ef7bbd9230ab1e0a51361"
}
},
"depparse": { ... },
"lemma": { ... },
"mwt": { ... },
"pos": { .. },
"ner": {
"ner_tagger": {
"md5": "8a55a3e9dc1fe4fdb37c655d1f5a4958",
"dependencies": [
{
"model": "forward_charlm",
"package": "fc"
},
{
"model": "backward_charlm",
"package": "bc"
}
]
}
},
"pretrain": { ... },
"tokenize": {...},
"default_processors": {
"tokenize": "combined",
"mwt": "combined",
"lemma": "combined",
"pos": "combined",
"depparse": "combined",
"ner": "ner_tagger"
},
"default_dependencies": {
"pos": [ ... ],
"depparse": [ ... ],
"ner": [
{
"model": "forward_charlm",
"package": "fc"
},
{
"model": "backward_charlm",
"package": "bc"
}
]
},
"default_md5": "41abf124c0ac3f645c04b66e1a8c049c",
"lang_name": "Italian"
},
"italian": {
"alias": "it"
},
"url": "http://nlp.stanford.edu/software/stanza"
}
It seems working, but I don't know if I made some mistakes. Is it correct?
Should be fine. One thing to keep in mind is that downloading new models will clobber your resources changes. You can also encode these changes into the way you build the pipeline, though, and that way there won't be a conflict. For example,
Pipeline("it", ner_forward_charlm_path="...", ner_backward_charlm_path="...")
What NER dataset did you use?
@AngledLuffa many thanks for the hints. I was at the version 1.2.0 and didn't see the new models. I have noticed now an Italian NER model exists in 1.2.3 (with also the charlm). I used WikiNER https://metatext.io/datasets/wikiner for my training.
It seems working well, but it makes some errors when some words have wrong upper case. In my use case the text can be slightly imperfect (containing capital letters on some common words).
I see that a sentence like:
text = "Pietro Rossi ha seguito i Corsi all'Università di Milano"
has these entities with my model:
[{
"text": "Pietro Rossi",
"type": "PER",
"start_char": 0,
"end_char": 12
},
{
"text": "Corsi",
"type": "LOC",
"start_char": 26,
"end_char": 31
},
{
"text": "Università di Milano",
"type": "LOC",
"start_char": 36,
"end_char": 56
}]
And the word Corsi
should have been corsi
and the model makes confusion on that. While I have seen your model doesn't makes this error. I see this behavior is repeated for all words beginning with upper case, so can it be that I trained the model too much, overfitting it?
A couple of the features turn on/off checking capitalization. Did you change any of those features when training either the charlm or the NER?
On Wed, Sep 1, 2021 at 6:25 AM Paolo Magnani @.***> wrote:
@AngledLuffa https://github.com/AngledLuffa many thanks for the hints. I was at the version 1.2.0 and didn't see the new models. I have noticed now an Italian NER model exists in 1.2.3 (with also the charlm). I used WikiNER https://metatext.io/datasets/wikiner for my training.
It seems working well, but it makes some errors when some words have wrong upper case. In my use case the text can be slightly imperfect (containing capital letters on some common words).
I see that a sentence like:
text = "Pietro Rossi ha seguito i Corsi all'Università di Milano"
has these entities with my model:
[{
"text": "Pietro Rossi",
"type": "PER",
"start_char": 0,
"end_char": 12
},
{
"text": "Corsi",
"type": "LOC",
"start_char": 26,
"end_char": 31
},
{
"text": "Università di Milano",
"type": "LOC",
"start_char": 36,
"end_char": 56
}]
And the word Corsi should have been corsi and the model makes confusion on that. While I have seen your model doesn't makes this error. I see this behavior is repeated for all words beginning with upper case, so can it be that I trained the model too much, overfitting it?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/717#issuecomment-910283536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOV36ZSTIXKXISB3RTT7YSTTANCNFSM46BJ4MEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
A couple of the features turn on/off checking capitalization. Did you change any of those features when training either the charlm or the NER?
No, I leave all the parameters to default. Indeed it seems to recognize capital letters...but even too much, because almost every word with the first letter in upper case is taken as an entity 😜
Hi, I have to do a training with my dataset of a new NER model.
I'm following the instructions in https://github.com/stanfordnlp/stanza-train to train a NER model, so to train 2 charlm:
and then train the NER model:
My question is: do you have experimented good parameters to pass to these scripts? The parameters are described in ner_tagger.py and on the charlm.py. For example a good learning rate or parameters like
--charlm
,char_lowercase
, etc.One very important question is: it is recommended to use an uncased text or a cased one for charlm and ner tagger? My intuition is uppercase is good to recognize entities, so cased. And I imagine that if the NER dataset is cased, also the plain text for charlm has to be cased and viceversa. Am I right?
If there is a standard set of training parameters you have used to train all the Stanza models, it should be good to follow your best practices. Thank you.