jkwill87 / mnamer

media file renaming and organizing tool
https://pypi.org/project/mnamer
MIT License
750 stars 63 forks source link

Guessing subtitle language based on subtitle text #287

Open big-eater opened 6 months ago

big-eater commented 6 months ago

I saw the existing suggestion https://github.com/jkwill87/mnamer/issues/130 , and also think that something like that would be a great addition.

I have tried implementing it, and will create a pull request. If you think that it fits in this project, but don't like something about how it's implemented, or feel that something is missing (e.g. tests and usage documentation), let me know. And if you think "ah, great idea, but I want just this or that part", or it inspires you to do something similar, feel free to take any part of it. I don't care about getting credit for the implementation, I just would be happy that it's available.

I tried a few different text language guessers. Many language guessers are not easy to install on some platforms. Notably, I don't have access to a Windows machine, so have not tested the installation on Windows. For that reason, I think it would be good to have a few different options.

I tried but gave up on CLDv3 (suggested in https://github.com/jkwill87/mnamer/issues/130), because it doesn't currently work with python 3.11 / 3.12.

These are the language guessers that I integrated:

To try it out, a version can be installed like this: Install all guessers:

pip install mnamer[guess_all]@git+https://git@github.com/big-eater/mnamer.git@subtitle-text-guesser

Or, install one or more of guess_langid, guess_lingua, guess_fasttext, guess_langdetect, for example:

pip install mnamer[guess_langid]@git+https://git@github.com/big-eater/mnamer.git@subtitle-text-guesser

To use it, specify a guesser when running mnamer:

mnamer --test --batch --subtitle-lang-guesser=langid /path/to/files

It will only try guessing the language from the text in the subtitle file if it was unable to guess the language from the file name.