segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
677 stars 39 forks source link

More language support. #5

Closed aguang-xyz closed 4 years ago

aguang-xyz commented 4 years ago

Hi, a lot of thanks to your project.

In the README, it says:

Alternatively, you can also load your own model.

Where can I can find models for other languages except English and German? Or could you tell me how to train my own model for other languages step by step? I'm happy to contribute for providing more models.

Thank you, Guangrui Wang

bminixhofer commented 4 years ago

Hi! Thanks for your interest!

So first off, I've been working a major rewrite so not much of the stuff in the README is up to date with the current code in the repository - apologies for that.

NNSplit needs some "teacher" tokenizer to learn from. The latest release uses SoMaJo which only supports English and German. I have switched this to spaCy so now NNSplit can work with any language spaCy supports.

Problem is that it is not released yet and there are still some things that are missing (e. g. proper evaluation).

If you want to try it now I can do a prerelease and polish some of the training scripts so you can train your model. There are some possibly unstable, not properly tested parts though and docs are mostly missing.

Do you plan to use NNSplit from Rust, Python or Javascript?

Alternatively you can wait a couple of weeks and I'll try to finish up the rewrite :)

aguang-xyz commented 4 years ago

Hi! Thanks for your reply.

The reason why I'm interested in your project is I'm working on an automatic captioning program. Biefly after speech recognization, I have a sort of words without punctuations. Your library is exactly what I want to help splitting words into sentences. I've integreted your library to support English captions, if more languages can be supported in nnsplit, my program can support more languages easily.

My program is written from Python. So I would be very appreciated if you can publish a prerelease on Pypi.

And I have a little experience of PyTorch, I'm happy to learn from your code if you can share your rewrite branch :)

bminixhofer commented 4 years ago

Ok, good to hear! I'm working on it. Will probably take some time though because I'm quite busy at the moment.

adrien-jacquot commented 4 years ago

Hello! Just to mention that I have a similar interest: having a French model (I'll be using NNsplit in Python). Looking forward to the new release :)

bminixhofer commented 4 years ago

I finally made some good progress on this!

https://github.com/bminixhofer/nnsplit/blob/master/train/train.ipynb is a Colab notebook which walks through training a custom model, installing the latest nightly NNSplit version, and using it for inference from Python.

The nightly version is reasonably stable, and I don't expect the public API to change anymore. It is also 10-20 times faster than the previous version through a Rust backend.

I opted for building nightly wheels from master instead of doing a pre-release since it isn't quite ready yet. Still missing is:

So it will probably take some more time until the release ;)

aguang-xyz commented 4 years ago

Good to hear your progress! I'm appreciated to learn from you.

bminixhofer commented 4 years ago

Great! You can actually go ahead and try the notebook now if you want. It shows training on a custom dataset and loading the model for inference.

bminixhofer commented 4 years ago

I just released the new version!

I also finished the evaluation but I'm not 100% satisfied with the results yet, I'll retrain the model next week when I have access to better hardware (and maybe see if I can improve the architecture a bit).

Feel free to train models for other languages :)

EmilStenstrom commented 4 years ago

Not sure if this is the correct issue to ask for this, but pre-trained models for more languages would be fantastic! It's 100x easier to just use a pre-trained model compared to training one from scratch...

bminixhofer commented 4 years ago

Yes it is! I'll soon™ have access to better hardware again. I'll try improving the models then and can also train models for other languages if there is demand.

Is there a specific language you would need a model for? I can add a wish list for languages at the top of this issue, but I won't just train models for every language I can think of if there's no one who will use them.

Also: The train.ipynb notebook linked above walks you through training a new model. Feel free to contribute any models trained this way!

EmilStenstrom commented 4 years ago

I'm currently exploring different libraries to build my project on, so having pre-trained models to support for them here is of course a plus for using nnsplit. I understand that providing pre-trained models for many languages is a little bit of a chicken and egg problem: if they exist you increase the chance that they'll be used, but before you know that they will be used it's not worth the effort :)

I'm looking for support for Swedish and Norwegian, so I'll add them to the "wishlist" you suggest! :)

Thanks for all your hard work!

bminixhofer commented 4 years ago

Thanks, great. I created a separate issue (#11) since I realized I didn't open this one and added the languages there.