stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.3k stars 896 forks source link

Hints for new NER training #717

Closed paulthemagno closed 3 years ago

paulthemagno commented 3 years ago

Hi, I have to do a training with my dataset of a new NER model.

I'm following the instructions in https://github.com/stanfordnlp/stanza-train to train a NER model, so to train 2 charlm:

bash scripts/run_charlm.sh English-TEST forward --epochs 2 --cutoff 0 --batch_size 2
bash scripts/run_charlm.sh English-TEST backward --epochs 2 --cutoff 0 --batch_size 2

and then train the NER model:

bash scripts/run_ner.sh English-TEST --max_steps 500 --word_emb_dim 5 --charlm --charlm_shorthand en_test --char_hidden_dim 1024

My question is: do you have experimented good parameters to pass to these scripts? The parameters are described in ner_tagger.py and on the charlm.py. For example a good learning rate or parameters like --charlm, char_lowercase, etc.

# in ner_tagger.py
parser = argparse.ArgumentParser()
parser.add_argument('--data_dir', type=str, default='data/ner', help='Root dir for saving models.')
parser.add_argument('--wordvec_dir', type=str, default='extern_data/word2vec', help='Directory of word vectors')
parser.add_argument('--wordvec_file', type=str, default='', help='File that contains word vectors')
parser.add_argument('--wordvec_pretrain_file', type=str, default=None, help='Exact name of the pretrain file to read')
parser.add_argument('--train_file', type=str, default=None, help='Input file for data loader.')
parser.add_argument('--eval_file', type=str, default=None, help='Input file for data loader.')

parser.add_argument('--mode', default='train', choices=['train', 'predict'])
parser.add_argument('--finetune', action='store_true', help='Load existing model during `train` mode from `save_dir` path')
parser.add_argument('--train_classifier_only', action='store_true',
                    help='In case of applying Transfer-learning approach and training only the classifier layer this will freeze gradient propagation for all other layers.')
parser.add_argument('--lang', type=str, help='Language')
parser.add_argument('--shorthand', type=str, help="Treebank shorthand")

parser.add_argument('--hidden_dim', type=int, default=256)
parser.add_argument('--char_hidden_dim', type=int, default=100)
parser.add_argument('--word_emb_dim', type=int, default=100)
parser.add_argument('--char_emb_dim', type=int, default=100)
parser.add_argument('--num_layers', type=int, default=1)
parser.add_argument('--char_num_layers', type=int, default=1)
parser.add_argument('--pretrain_max_vocab', type=int, default=100000)
parser.add_argument('--word_dropout', type=float, default=0)
parser.add_argument('--locked_dropout', type=float, default=0.0)
parser.add_argument('--dropout', type=float, default=0.5)
parser.add_argument('--rec_dropout', type=float, default=0, help="Word recurrent dropout")
parser.add_argument('--char_rec_dropout', type=float, default=0, help="Character recurrent dropout")
parser.add_argument('--char_dropout', type=float, default=0, help="Character-level language model dropout")
parser.add_argument('--no_char', dest='char', action='store_false', help="Turn off training a character model.")
parser.add_argument('--charlm', action='store_true', help="Turn on contextualized char embedding using pretrained character-level language model.")
parser.add_argument('--charlm_save_dir', type=str, default='saved_models/charlm', help="Root dir for pretrained character-level language model.")
parser.add_argument('--charlm_shorthand', type=str, default=None, help="Shorthand for character-level language model training corpus.")
parser.add_argument('--char_lowercase', dest='char_lowercase', action='store_true', help="Use lowercased characters in character model.")
parser.add_argument('--no_lowercase', dest='lowercase', action='store_false', help="Use cased word vectors.")
parser.add_argument('--no_emb_finetune', dest='emb_finetune', action='store_false', help="Turn off finetuning of the embedding matrix.")
parser.add_argument('--no_input_transform', dest='input_transform', action='store_false', help="Do not use input transformation layer before tagger lstm.")
parser.add_argument('--scheme', type=str, default='bioes', help="The tagging scheme to use: bio or bioes.")

parser.add_argument('--sample_train', type=float, default=1.0, help='Subsample training data.')
parser.add_argument('--optim', type=str, default='sgd', help='sgd, adagrad, adam or adamax.')
parser.add_argument('--lr', type=float, default=0.1, help='Learning rate.')
parser.add_argument('--min_lr', type=float, default=1e-4, help='Minimum learning rate to stop training.')
parser.add_argument('--momentum', type=float, default=0, help='Momentum for SGD.')
parser.add_argument('--lr_decay', type=float, default=0.5, help="LR decay rate.")
parser.add_argument('--patience', type=int, default=3, help="Patience for LR decay.")

parser.add_argument('--max_steps', type=int, default=200000)
parser.add_argument('--eval_interval', type=int, default=500)
parser.add_argument('--batch_size', type=int, default=32)
parser.add_argument('--max_grad_norm', type=float, default=5.0, help='Gradient clipping.')
parser.add_argument('--log_step', type=int, default=20, help='Print log every k steps.')
parser.add_argument('--save_dir', type=str, default='saved_models/ner', help='Root dir for saving models.')
parser.add_argument('--save_name', type=str, default=None, help="File name to save the model")

parser.add_argument('--seed', type=int, default=1234)
parser.add_argument('--cuda', type=bool, default=torch.cuda.is_available())
parser.add_argument('--cpu', action='store_true', help='Ignore CUDA.')

One very important question is: it is recommended to use an uncased text or a cased one for charlm and ner tagger? My intuition is uppercase is good to recognize entities, so cased. And I imagine that if the NER dataset is cased, also the plain text for charlm has to be cased and viceversa. Am I right?

If there is a standard set of training parameters you have used to train all the Stanza models, it should be good to follow your best practices. Thank you.

AngledLuffa commented 3 years ago

By default the NER model will use cased information for the charlm (as per your intuition) and use lowercasing for the embedding, since embeddings tend to not have case information.

BUT, unless there is a compelling reason why the charlm we provide isn't sufficient, you probably just want to download the default EN models and reuse that charlm.

I suppose the run_charlm.sh script is fine, but we also left a bunch of notes on how to run the python tool directly. The short summary is that the defaults generally work, but it can be helpful to get more frequent dev scores like this:

--eval_steps 100000

and sometimes the learning goes a bit haywire at the default learning rate. If you observe the dev scores spiking up, or even worse, going to NaN, you can lower the learning rate with

--lr0 10 --lr0 5

paulthemagno commented 3 years ago

Ok @AngledLuffa thanks. So you suggest not to touch the other default parameters in charlm.py and ner_tagger.py, don't you?

About charlm, I'm searching for Italian. I haven't seen any charlm for that language in Stanza, so I wanted to take some plain text and training on my own. If some charlm already exist, it would be great!

paulthemagno commented 3 years ago

@AngledLuffa another question regarding the training of the charlm. I'm trying to launch charlm.sh with a Docker with 16GB of RAM allocated. I'm seeing the script is killed in this row:

lines = open(path).readlines() # reserve '\n'
data = [list(line) for line in lines] #here the script is killed after allocating 30% of lines
vocab = CharVocab(data, cutoff=cutoff)

I imagine it's because the creation of the list of characters is taking too much memory. Have you found any hack to solve this problem?

AngledLuffa commented 3 years ago

Correct, we currently don't have an Italian charlm or even an Italian NER model. It's on the list of things to do, though.

More memory or less data? I can take a look to see if it's possible to allocate less memory.

AngledLuffa commented 3 years ago

Actually yes! There is already a trick which should avoid this. Put the training file in a directory by itself and pass in the directory name, not the file name.

We'll change it so such hackery is not needed in the next version.

AngledLuffa commented 3 years ago

Better yet, if you want to be a guinea pig, you can look at the next version now:

https://test.pypi.org/project/stanza/1.2.1rc0/

paulthemagno commented 3 years ago

Wow! What's new in this version? Anyway thanks @AngledLuffa I'll try this method and see if it works!

AngledLuffa commented 3 years ago

Well, we retrained all the models in ud 2.8, which included a bunch of data fixes in the models and a bunch of improvements in our ways of reading the data. There's various other bug fixes and some efficiency improvements in the pipeline. Also, I just added a fix which should make it so the single data file and multiple data files use the same code path in the charlm :)

AngledLuffa commented 3 years ago

Is this working for you? If not, we can try to work through it. Also, we'd definitely be interested in hearing about viable Italian NER datasets which we can use to train and distribute models.

paulthemagno commented 3 years ago

I'm proceeding. Currently I'm fighting with RAM problems (too large plain texts for charlm training), but I'll increase my RAM capacity. Next I'll go with NER.

One hint that comes to my mind: I was seeing the vocabulary that charlm.py uses and I have noticed that many datasets could contain strange characters. For example I have some Italian texts that contain some Chinese, Arabic, etc. characters (probably deriving from scraping or other unknown reasons), obviously with a lower frequency compared to alfanumeric characters. Not only: I found also exadecimal characters and so on. I print here a possible frequency on a Italian text, printing the counter in this line like this:

def build_vocab(path, cutoff=0):
    # Requires a large amount of memory, but only need to build once

    # here we need some trick to deal with excessively large files
    # for each file we accumulate the counter of characters, and
    # at the end we simply pass a list of chars to the vocab builder
    counter = Counter()
    if os.path.isdir(path):
        filenames = sorted(os.listdir(path))
    else:
        filenames = [path]
    for filename in filenames:
        lines = readlines(path + '/' + filename)
        for line in lines:
            counter.update(list(line))
    print(counter) #added print by me
    # remove infrequent characters from vocab
    for k in list(counter.keys()):
        if counter[k] < cutoff:
            del counter[k]
    # a singleton list of all characters
    data = [sorted([x[0] for x in counter.most_common()])]
    vocab = CharVocab(data) # skip cutoff argument because this has been dealt with
    return vocab

and the most frequent characters are these:

Counter({' ': 7681209538, 'i': 4551454294, 'e': 4549456069, 'a': 4238537950, 'o': 3673461662, 'n': 2866799737, 't': 2731902667, 'r': 2600313164, 'l': 2411084490, 's': 1956850370, 'c': 1681083743, 'd': 1452379257, 'u': 1160736023, 'p': 1116443657, 'm': 1030144651, 'g': 682786699, 'v': 586241491, ',': 457130661, 'z': 454814799, 'f': 403655586, 'h': 384044936, '.': 377223176, 'b': 353087237, '\n': 147245274, 'q': 143800889, 'S': 118540273, 'I': 118217353, 'C': 116475805, 'A': 111440149, '’': 104395912, '0': 95776805, 'à': 95516545, 'è': 91654978, 'P': 91392963, '1': 90574488, 'L': 89241371, "'": 76209116, 'M': 76024315, '2': 71550984, 'E': 67419510, 'T': 60758504, 'R': 60336513, 'D': 59810296, 'N': 56366787, ':': 51479131, ')': 49107911, 'G': 48166272, '-': 47093491, '(': 46722880, 'O': 45860493, 'B': 42938064, 'F': 40740813, 'V': 34614337, '3': 34541965, 'k': 33717787, 'ù': 33647590, '5': 32703922, 'U': 32202105, 'y': 30960743, '4': 28286291, '9': 27387565, '"': 26535985, '8': 24948190, 'w': 24881007, '6': 23979966, '“': 22882357, '”': 22372757, 'ò': 22334291, '7': 22194867, '/': 19307840, '!': 17459881, 'x': 17019626, ';': 16387798, 'H': 15506209, 'Q': 15326138, 'é': 15280821, '?': 14966577, 'ì': 13754672, '–': 10816460, 'W': 9019297, '…': 8292947, 'K': 7256676, 'Z': 6658745, '%': 6545873, 'J': 6418997, '»': 5384401, '«': 4861160, 'X': 4666950, 'j': 4635443, ']': 4453657, '[': 4432564, '°': 3929110, 'È': 3341311, 'Y': 3147401, '|': 3134096, '&': 2604805, '\t': 2576813, '#': 2359625, '_': 2172140, '+': 2052629, '*': 1895329, '‘': 1878549, '>': 1762312, '=': 1716578, '�': 1638962, '€': 1466127, '@': 960594, '\xad': 920302, '·': 812468, 'о': 721241, '}': 717992, '{': 717357, '<': 709897, '•': 676679, '\x92': 650005, 'а': 622357, 'е': 586218, 'и': 571733, '\\': 535236, 'á': 534133, 'н': 475810, '^': 473781, '—': 464684, 'т': 451349, '̀': 450494, '´': 446731, 'Ã': 419609, '`': 394870, 'р': 390003, 'с': 377038, '$': 363224, '®': 355264, 'í': 347783, '→': 325266, 'ó': 318881, 'в': 305175, 'ü': 288814, 'л': 287508, 'к': 270952, 'â': 254923, '\u200b': 244063, 'ö': 221242, 'м': 217895, 'д': 201734, 'ú': 189713, 'п': 182577, 'ä': 181229, 'ا': 177980, 'у': 170989, '™': 162521, '\ufeff': 156089, 'À': 155087, '¨': 152663, 'ç': 152638, '′': 149080, 'ы': 142476, 'я': 141928, '🙂': 141483, '²': 136941, '©': 132554, 'É': 119967, 'ь': 116641, 'г': 112931, '\x93': 111265, 'з': 110647, 'б': 110050, '×': 109331, 'α': 108837, '\x94': 107482, 'ñ': 106392, 'º': 106276, 'й': 104848, 'ª': 103637, '~': 101901, '›': 100480, 'ل': 97917, '😉': 90561, 'Â': 89647, '\x85': 89577, 'っ': 84191, 'ч': 79399, '″': 78309, 'Г': 77878, 'ã': 76042, '¬': 74738, 'ر': 74519, '\u2028': 73783, 'م': 72543, 'ο': 72028, 'š': 70102, 'č': 68969, 'ê': 66888, 'و': 66824, 'ن': 66626, ',': 65865, 'ë': 64865, 'τ': 62954, '\x80': 62022, 'ı': 60967, 'ι': 60478, 'х': 60151, 'ν': 59454, 'ж': 59129, '的': 59119, 'ε': 58647, 'ï': 58620, '½': 58359, 'ي': 57403, 'ø': 57091, '彩': 56459, 'ت': 55338, 'ô': 54972, 'ā': 54674, '、': 54048, '¹': 53711, '天': 53690, 'å': 52399, 'Ø': 51737, ...

And going forward other strange characters appear. Since there is the cutoff parameter (set to 1000 by default) I thought to remove from the vocab all these characters, seeing which was the highest frequency of non Italian characters, anyway I see that a symbol like ©(that I would keep) has a lower frequency than д so if I remove д, I'll remove © too (this is only an example). So my question is: is it better or not to keep in the vocabulary these characters (or emoji)?

To make comparison with other models, I loaded the English forward charlm 1billion.pt in stanza_resources and printed the vocabulary, seeing that Chinese and other strange characters occour in the English charlm vocab too.

AngledLuffa commented 3 years ago

Generally speaking, it seems to work fine for us without any special effort to remove those characters.

I don't know if it's relevant to the memory issues you're running into, but the change to reduce the amount of memory used when building the vocab is now part of the official 1.2.1 release

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically closed due to inactivity.

paulthemagno commented 3 years ago

Hi @AngledLuffa , I tried both training using charlm and not.

Without the charlm training I was able to load the model updating resources.json with the ner field in it.

After retrying with also charlm, it gives me an error:

pipeline = stanza.Pipeline("it", processors = "tokenize,ner", dir = "/path/to/stanza_resources")
2021-08-30 20:26:28 WARNING: Language it package default expects mwt, which has been added
2021-08-30 20:26:28 INFO: Loading these models for language: it (Italian):
========================
| Processor | Package  |
------------------------
| tokenize  | combined |
| mwt       | combined |
| ner       | wikiner  |
========================

2021-08-30 20:26:28 INFO: Use device: cpu
2021-08-30 20:26:28 INFO: Loading: tokenize
2021-08-30 20:26:28 INFO: Loading: mwt
2021-08-30 20:26:28 INFO: Loading: ner
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 308, in _check_seekable
    f.seek(f.tell())
AttributeError: 'NoneType' object has no attribute 'seek'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/core.py", line 130, in __init__
    use_gpu=self.use_gpu)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/processor.py", line 155, in __init__
    self._set_up_model(config, use_gpu)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/ner_processor.py", line 27, in _set_up_model
    self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/trainer.py", line 40, in __init__
    self.load(model_file, args)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/trainer.py", line 131, in load
    self.model = NERTagger(self.args, self.vocab)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/model.py", line 38, in __init__
    add_unsaved_module('charmodel_forward', CharacterLanguageModel.load(args['charlm_forward_file'], finetune=False))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/common/char_model.py", line 135, in load
    state = torch.load(filename, lambda storage, loc: storage)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 581, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 235, in _open_file_like
    return _open_buffer_reader(name_or_buffer)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 220, in __init__
    _check_seekable(buffer)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 311, in _check_seekable
    raise_err_msg(["seek", "tell"], e)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 304, in raise_err_msg
    raise type(e)(msg)
AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

I suppose it is because I have to add the charlms in the Stanza config. First I have added the forward and the backward charlms to the Italian folders, but how to update resources.json ?

I'm seeing that for English, as instance, is written in this way:

"forward_charlm": {
      "1billion": {
        "md5": "468b3377455fa0311565d46865f55afb"
      },
      "mimic": {
        "md5": "3d3863e27afd67bf356354a728b0fb76"
      },
      "pubmed": {
        "md5": "3f734806be2c2f62c82fbb01421f78ec"
      }
    },

What do I have to write in the md5 field for my charlms?

AngledLuffa commented 3 years ago

md5 is literally the md5sum of the model files. It's to make it easier to verify that we have the right download. We actually do have IT charlms publicly available, fwiw. Of course, if your models wind up being better, that would be great

On Mon, Aug 30, 2021 at 11:34 AM Paolo Magnani @.***> wrote:

Hi @AngledLuffa https://github.com/AngledLuffa , I tried both training using charlm and not.

Without the charlm training I was able to load the model updating resources.json with the ner field in it.

After retrying with also charlm, it gives me an error:

pipeline = stanza.Pipeline("it", processors = "tokenize,ner", dir = "/path/to/stanza_resources")

2021-08-30 20:26:28 WARNING: Language it package default expects mwt, which has been added 2021-08-30 20:26:28 INFO: Loading these models for language: it (Italian):

| Processor | Package |

| tokenize | combined | | mwt | combined | | ner | wikiner |

2021-08-30 20:26:28 INFO: Use device: cpu 2021-08-30 20:26:28 INFO: Loading: tokenize 2021-08-30 20:26:28 INFO: Loading: mwt 2021-08-30 20:26:28 INFO: Loading: ner Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 308, in _check_seekable f.seek(f.tell()) AttributeError: 'NoneType' object has no attribute 'seek'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/core.py", line 130, in init use_gpu=self.use_gpu) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/processor.py", line 155, in init self._set_up_model(config, use_gpu) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/pipeline/ner_processor.py", line 27, in _set_up_model self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/trainer.py", line 40, in init self.load(model_file, args) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/trainer.py", line 131, in load self.model = NERTagger(self.args, self.vocab) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/ner/model.py", line 38, in init add_unsaved_module('charmodel_forward', CharacterLanguageModel.load(args['charlm_forward_file'], finetune=False)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/stanza/models/common/char_model.py", line 135, in load state = torch.load(filename, lambda storage, loc: storage) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 581, in load with _open_file_like(f, 'rb') as opened_file: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 235, in _open_file_like return _open_buffer_reader(name_or_buffer) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 220, in init _check_seekable(buffer) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 311, in _check_seekable raise_err_msg(["seek", "tell"], e) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/serialization.py", line 304, in raise_err_msg raise type(e)(msg) AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

I suppose it is because I have to add the charlms in the Stanza config. First I have added the forward and the backward charlms to the Italian folders, but how to update resources.json ?

I'm seeing that for English, as instance, is written in this way:

"forward_charlm": { "1billion": { "md5": "468b3377455fa0311565d46865f55afb" }, "mimic": { "md5": "3d3863e27afd67bf356354a728b0fb76" }, "pubmed": { "md5": "3f734806be2c2f62c82fbb01421f78ec" } },

What do I have to write in the md5 field for my charlms?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/717#issuecomment-908586823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKC4HA67NQD3XUBSNTT7PFMNANCNFSM46BJ4MEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

paulthemagno commented 3 years ago

ok, perfect. Where can I load these Italian charlm? I knew they didn't exist yet.

Ok, assuming that my forward_charlm is called fc.pt, my backward_charlm is called bc.pt and my ner tagger is called ner_tagger.pt, I managed to make it work setting these new fields ner, forward_charlm and backward_charlm, taking inspiration by the English model.

{
  "it": {
    "forward_charlm": {
      "fc": {
        "md5": "1e078546a7431fb2a2a7296a6b45ff03"
      }
    },
    "backward_charlm": {
      "bc": {
        "md5": "3261357e1d6ef7bbd9230ab1e0a51361"
      }
    },
    "depparse": { ... },
    "lemma": { ... },
    "mwt": { ... },
    "pos": { .. },
    "ner": {
      "ner_tagger": {
        "md5": "8a55a3e9dc1fe4fdb37c655d1f5a4958",
        "dependencies": [
          {
            "model": "forward_charlm",
            "package": "fc"
          },
          {
            "model": "backward_charlm",
            "package": "bc"
          }
        ]
      }
    },
    "pretrain": { ... },
    "tokenize": {...},
    "default_processors": {
      "tokenize": "combined",
      "mwt": "combined",
      "lemma": "combined",
      "pos": "combined",
      "depparse": "combined",
      "ner": "ner_tagger"
    },
    "default_dependencies": {
      "pos": [ ... ],
      "depparse": [ ... ],
      "ner": [
        {
          "model": "forward_charlm",
          "package": "fc"
        },
        {
          "model": "backward_charlm",
          "package": "bc"
        }
      ]
    },
    "default_md5": "41abf124c0ac3f645c04b66e1a8c049c",
    "lang_name": "Italian"
  },
"italian": {
  "alias": "it"
},
"url": "http://nlp.stanford.edu/software/stanza"
}

It seems working, but I don't know if I made some mistakes. Is it correct?

AngledLuffa commented 3 years ago

Should be fine. One thing to keep in mind is that downloading new models will clobber your resources changes. You can also encode these changes into the way you build the pipeline, though, and that way there won't be a conflict. For example,

Pipeline("it", ner_forward_charlm_path="...", ner_backward_charlm_path="...")

What NER dataset did you use?

paulthemagno commented 3 years ago

@AngledLuffa many thanks for the hints. I was at the version 1.2.0 and didn't see the new models. I have noticed now an Italian NER model exists in 1.2.3 (with also the charlm). I used WikiNER https://metatext.io/datasets/wikiner for my training.

It seems working well, but it makes some errors when some words have wrong upper case. In my use case the text can be slightly imperfect (containing capital letters on some common words).

I see that a sentence like:

text = "Pietro Rossi ha seguito i Corsi all'Università di Milano"

has these entities with my model:

[{
   "text": "Pietro Rossi",
   "type": "PER",
   "start_char": 0,
   "end_char": 12
 },
 {
   "text": "Corsi",
   "type": "LOC",
   "start_char": 26,
   "end_char": 31
 },
 {
   "text": "Università di Milano",
   "type": "LOC",
   "start_char": 36,
   "end_char": 56
 }]

And the word Corsi should have been corsi and the model makes confusion on that. While I have seen your model doesn't makes this error. I see this behavior is repeated for all words beginning with upper case, so can it be that I trained the model too much, overfitting it?

AngledLuffa commented 3 years ago

A couple of the features turn on/off checking capitalization. Did you change any of those features when training either the charlm or the NER?

On Wed, Sep 1, 2021 at 6:25 AM Paolo Magnani @.***> wrote:

@AngledLuffa https://github.com/AngledLuffa many thanks for the hints. I was at the version 1.2.0 and didn't see the new models. I have noticed now an Italian NER model exists in 1.2.3 (with also the charlm). I used WikiNER https://metatext.io/datasets/wikiner for my training.

It seems working well, but it makes some errors when some words have wrong upper case. In my use case the text can be slightly imperfect (containing capital letters on some common words).

I see that a sentence like:

text = "Pietro Rossi ha seguito i Corsi all'Università di Milano"

has these entities with my model:

[{

"text": "Pietro Rossi",

"type": "PER",

"start_char": 0,

"end_char": 12

},

{

"text": "Corsi",

"type": "LOC",

"start_char": 26,

"end_char": 31

},

{

"text": "Università di Milano",

"type": "LOC",

"start_char": 36,

"end_char": 56

}]

And the word Corsi should have been corsi and the model makes confusion on that. While I have seen your model doesn't makes this error. I see this behavior is repeated for all words beginning with upper case, so can it be that I trained the model too much, overfitting it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/717#issuecomment-910283536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOV36ZSTIXKXISB3RTT7YSTTANCNFSM46BJ4MEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

paulthemagno commented 3 years ago

A couple of the features turn on/off checking capitalization. Did you change any of those features when training either the charlm or the NER?

No, I leave all the parameters to default. Indeed it seems to recognize capital letters...but even too much, because almost every word with the first letter in upper case is taken as an entity 😜