segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
666 stars 39 forks source link

EN model training in Google Colab #29

Closed UncleLiaoNN closed 1 year ago

UncleLiaoNN commented 3 years ago

Hello, with use of Google Colab I was able to train a model for Russian language. But when I start training a model for English language with trainer.fit(model), it floods output (hundreds of messages):

[W108] The rule-based lemmatizer did not find POS annotation for the token ')'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'Marjorie'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'But'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'and'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'Daw'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'It'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token '('. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'made'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.

and connection to runtine is lost

Do you have an idea how I could get rid of this output mesages?

Link to Colab: https://colab.research.google.com/drive/1xjrD1ZvkzLypbuYaywkf7yHy9UAjYq-g?usp=sharing (the model is a bit chaged to have int8 input)

bminixhofer commented 3 years ago

IIRC I got some of these messages too. A quick fix is downgrading to Spacy < 3 (you'll only have to change nlp.add_pipe("sentencizer") to nlp.add_pipe(nlp.create_pipe("sentencizer")) somewhere in the labeler.py). I'll see if I can reproduce this.

UncleLiaoNN commented 3 years ago

@bminixhofer ok, thanks for the workaround, now there is no annoying messages

but not it says that 1 epoch with the same paramenters take 15 minutes instead of ~2 hours for the original setup:) will see the result soon