segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
677 stars 39 forks source link

Language wishlist #11

Closed bminixhofer closed 1 year ago

bminixhofer commented 4 years ago

A list of languages currently considered for training and adding to the Repo:

I'll see if I can train models for languages on this list. If you want to speed it up, just train it yourself following https://github.com/bminixhofer/nnsplit/blob/master/train/train.ipynb :)

adrien-jacquot commented 4 years ago

Hello! Can you add French to the list?

Thanks :)

SutirthaChakraborty commented 4 years ago

Can you please Add Turkish ?

bminixhofer commented 4 years ago

Sure!

SutirthaChakraborty commented 4 years ago

Sure!

THanks, will you add it today or later ?

bminixhofer commented 4 years ago

I added it to the list. I can't promise when I'll get around to training it.

SutirthaChakraborty commented 4 years ago

drive-download-20200912T101135Z-001.zip I have trained them on your training code, can you see and update today please ?

or can you tell me how can I use the model on these code

def recognize_speech(wav_path, lang="en", buffer_size=4000):

download_model(lang)

vosk.SetLogLevel(-1)

wav_file = wave.open(wav_path, "rb")

recognizer = vosk.KaldiRecognizer(
    vosk.Model("{}/{}".format(get_model_path(), lang)),
    wav_file.getframerate())

words = []

for index in tqdm(range(0, wav_file.getnframes(), buffer_size)):

    frames = wav_file.readframes(buffer_size)

    if recognizer.AcceptWaveform(frames):

        result = json.loads(recognizer.Result())

        if len(result["text"]) > 0:

            for token in result["result"]:
                words.append({
                    "start": token["start"],
                    "end": token["end"],
                    "text": token["word"],
                })
print(words)

return words
senemaktas commented 4 years ago

Can you please Add Turkish ?

Hey good match , I was looking that very long time ....

bminixhofer commented 4 years ago

Thanks for training the model! Your torchscript exported model was somehow broken.. Which pytorch version do you use?

I managed to recover it using the weights from the ONNX graph. This should work: torchscript_cpu_model.zip

(rename it from .zip to .pt, I'm too lazy to upload it externally right now)

Load it like this:

import nnsplit

splitter = nnsplit.NNSplit(torch.jit.load("torchscript_cpu_model.pt"), "cpu")
for sentence in splitter.split(["Bu bir cümle Bu ikinci bir cümle."])[0]:
    print(str(sentence))

which prints:

Bu bir cümle
Bu ikinci bir cümle.

(sorry if this is broken turkish)

I'll properly add it to the repository later!

Also, I'm happy to see some more interest in the library now, I'll move forward with some changes I had thought about (configurable speed / accuracy tradeoff at inference, more robust training).

SutirthaChakraborty commented 4 years ago

Thanks for training the model! Your torchscript exported model was somehow broken.. Which pytorch version do you use?

I managed to recover it using the weights from the ONNX graph. This should work: torchscript_cpu_model.zip

(rename it from .zip to .pt, I'm too lazy to upload it externally right now)

Load it like this:

import nnsplit

splitter = nnsplit.NNSplit(torch.jit.load("torchscript_cpu_model.pt"), "cpu")
for sentence in splitter.split(["Bu bir cümle Bu ikinci bir cümle."])[0]:
    print(str(sentence))

which prints:

Bu bir cümle
Bu ikinci bir cümle.

(sorry if this is broken turkish)

I'll properly add it to the repository later!

Also, I'm happy to see some more interest in the library now, I'll move forward with some changes I had thought about (configurable speed / accuracy tradeoff at inference, more robust training).

Thanks a ton. Great work.

aguang-xyz commented 4 years ago

Could you please also add Simplified Chinese? Thanks a lot.

dmenig commented 3 years ago

Very interested in french model as well !

bminixhofer commented 3 years ago

I was a bit busy lately. I'm now working on training and evaluating all models currently in the list (Norwegian, French, Swedish, Turkish, Simplified Chinese). Will be done tomorrow.

I also improved the model a bit, it's now faster and more accurate through a downsampling trick (downsample -> LSTM -> upsample) so I'm retraining English and German as well.

bminixhofer commented 3 years ago

I trained all the models and released them as Release 0.5.0.

You can now do:

import nnsplit

print(nnsplit.__version__) # should be 0.5.0-post0

# english
nnsplit.NNSplit.load("en")
# german
nnsplit.NNSplit.load("de")
# turkish
nnsplit.NNSplit.load("tr")
# french
nnsplit.NNSplit.load("fr")
# norwegian
nnsplit.NNSplit.load("no")
# swedish
nnsplit.NNSplit.load("sv")
# chinese
nnsplit.NNSplit.load("zh")

Training went well, metrics are in the README. I'll have to retrain the chinese model though: Chinese punctuation (e. g. ) is not in string.punctuation so it wasn't getting removed.

Also, as I mentioned, I made some improvements to the model architecture so it's quite a bit more accurate now.

There's also now #20 as a tracking issue for problems with these models.

EmilStenstrom commented 3 years ago

@bminixhofer Awesome! This is super helpful, thank you for putting effort into helping random people on the internet! :D

bminixhofer commented 3 years ago

You're welcome!

bminixhofer commented 3 years ago

@aguang-xyz As of 0.5.2 I retrained the Chinese model with fixed punctuation removal. It should now work properly for text without punctuation. Metrics are still not very good but consistently better than Spacy.

marlon-br commented 3 years ago

Hi, could you please add Russian?

egorsmkv commented 3 years ago

And for Ukrainian if this is not so hard. Anyway, I will try to build a model by myself using information from the notebook.

bminixhofer commented 3 years ago

Hi, sure!

I added them to the list. I'll give training them a go as well starting with Russian.

bminixhofer commented 3 years ago

@egorsmkv I noticed there is some code hardcoded to use a compound splitter at the moment, it needs some small changes in model.py to remove that. I'll fix it so you can train a model.

bminixhofer commented 3 years ago

The train.ipynb notebook is up to date now and the compound splitter issue fixed so training a model should work now.

I trained a model for Russian already and it looks good, I'll train another one for a bit longer and a Ukrainian model over night.

bminixhofer commented 3 years ago

Russian and Ukrainian are now trained & integrated in the Repo. Would be great if you could do a quick sanity check @egorsmkv @marlon-br i.e. check if they split text without errors correctly, don't split on abbreviations and split text with some missing punctuation and case correctly using the demo:

https://bminixhofer.github.io/nnsplit/#demo

since I speak neither of these languages. Metrics also look good:

https://bminixhofer.github.io/nnsplit/#metrics

marlon-br commented 3 years ago

@bminixhofer I tried Russian sample text. After I removed one comma (between воплощение and построенная) it started to split the sentence on two sentences. In this case second sentence doesn't make sence because it is totally dependent on the first part of the original sentence.

bminixhofer commented 3 years ago

Thanks for checking! There seems to have been a problem with , also being removed as punctuation during training. This might also impact some other languages, I'm retraining the affected models.

marlon-br commented 3 years ago

@bminixhofer I think that it would be a great synergy if you add all languages that are supported by vosk: https://alphacephei.com/vosk/models A lot of people would be very interested to get sentence boundaries etc. for the texts from ASR

egorsmkv commented 3 years ago

Just tested with Ukrainian sentences, looks really good! Thank you, Ben!

bminixhofer commented 3 years ago

Great. I'm retraining the models for 10 epochs to match the other models and fixed the punctuation issue, release will be latest on Monday.

marlon-br commented 3 years ago

@bminixhofer did you change something in the code or training script to fix the punctuation? I started to train Russian model by my own. Would like to understand if I should fix something too

bminixhofer commented 3 years ago

Yes, I just pushed the commit. The nnsplit training procedure is just:

The problem was that here:

with some probability, remove punctuation at the end of the sentences

all chars in string.punctuation were considered punctuation. Now, this is configurable with an argument to the SpacySentenceTokenizer. I use .?! for Russian.

The underlying problem is that SpaCy makes some mistakes e. g. splits after a comma in some cases. This is not solved by the update but not removing commas should be an improvement.

marlon-br commented 3 years ago

@bminixhofer I see, thanks!

How many time does it generally takes to train for 10 epochs?

bminixhofer commented 3 years ago

The bottleneck is often the SpaCy sentencizer. On my machine with an RTX 2080TI it takes ~ 2 hours for Ukrainian and ~ 10 hours for Russian.

Also, I should've said train with 10M samples. One epoch in the train.ipynb is set to use 500k samples and set to 1M by default in the Python scripts. SpaCy leaks memory when running in parallel across multiple cores. This is reset after each epoch. So you have to set the samples per epoch to something you don't run out of memory with :)

marlon-br commented 3 years ago

@bminixhofer I run Russian language model training on Google Colab Pro with V100 and it takes 2 hours for one epoch. So I expect it will take about 20 hours for 10 epochs. And this is 2 times more than you expect to have in your setup. But V100 is faster than 2080, so I wonder why it is longer

bminixhofer commented 3 years ago

It could very well be CPU bound. I use an i5 8600k.

The bottleneck is often the SpaCy sentencizer.

bminixhofer commented 3 years ago

Ukrainian and Russian updated models are now in the Repo. Both are significantly better now but in Russian there is still the same issue with the comma in the example text, I don't think there is anything I can do about that.

@marlon-br If you're training models you might be interested in https://wandb.ai/bminixhofer/nnsplit where I track the experiments e. g. https://wandb.ai/bminixhofer/nnsplit/runs/3poigs9a is the latest Russian run.

marlon-br commented 3 years ago

@bminixhofer AFAIK commas in Russian (and similar languages) are tricky even for native speakers. They have much more importance and sence than in other languages:)

marlon-br commented 3 years ago

This would be great if you could add next languages: Catalan, Dutch, Farsi, Italian, Portugese, Spanish and Vietnamese Thanks in advance!

bminixhofer commented 3 years ago

Sure, I added them to the list. I am currently focusing on nlprule so this may take some time, I appreciate PRs :) Ideally models should be trained on 10M samples but less is ok too.

marlon-br commented 3 years ago

@bminixhofer nlprule looks very interesting because you know, I use sentences splitting after ASR and since ASR is not perfect plus speaking language differs from language from Wikipedia sentence boundaries detection is also not perfect. For example the text after ASR looks like this: "hey guys i'm gabby wallace and this is a go natural english lesson i got a great question from a viewer about pronunciation you know one of the most difficult sounds an english but also one of the most common sounds is that are sound and i love teaching the sound because it kind of sounds funny i was think it sounds like a pirate right or can you imagine me with a little pirate high in a hook yeah or maybe well that's exactly what it is it's a pirate sound that's what i call it anyway so we're we're to work on our pirates sounds today one particular word the question that my view or us was how do you say gee i r l girl girl is a really common word right woman girl girls a young woman okay so this is a very common word we need to know how to say it especially if you are a girl you need to be able to say i'm a girl or hey girls only girls club i don't know when i was a teenager or not a teenager maybe more a kid we used to have girls only clubs okay anyway i'm getting off the point year let's talk about pronunciation"

I think nlprule could improve the results a bit :)

conanchen commented 2 years ago

Could you please also add Simplified Chinese? Thanks a lot.

will also for Traditional Chinese?

bminixhofer commented 2 years ago

Hi, feel free to keep requests coming here (so I know what to prioritize when I circle back to this library). However, I am currently not training any new models. You can train models yourself here: https://github.com/bminixhofer/nnsplit/blob/main/train/train.ipynb.

sabilmakbar commented 1 year ago

Hi, I'd like to request the language support for this model into Indonesian language (as the nearest lang to Indonesian is English, but the result is quite unsatisfactory for now). While I'll try to train the model by myself (from ur last comment), I'd like to keep a lang request for Indonesian (ISO 639-1 code: "id"), until I have the time to replicate the training process and make a PR if the result is quite decent.

Thanks!

bminixhofer commented 1 year ago

Indonesian is now supported! Along with 84 other languages (all languages mentioned in this thread unless I missed anything).

I am now not looking to expand language support further for now, so closing this issue.