pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.08k stars 44 forks source link

Transformer models for Language Detection #137

Closed ArtanisTheOne closed 1 year ago

ArtanisTheOne commented 1 year ago

I've been experimenting with language detection for a few months due to the necessity of accurate language detection for a translation project, where the detection of a wrong language can lead to text going down an incorrect pipeline and output nonsense to the individual who requested a translation. Because of this, I've been looking into language detection libraries such as lingua - but it's an incredibly complex thing to balance accuracy with latency, as you guys are well aware. Lingua is amazing, and I thank the maintainers/developers for it, but for so many cases it isn't usable due to latency issues with detections - especially in a production environment where people expect results automatically (the downside of the internet ig). So to solve this issue for myself, I finetuned a pre-trained AI model - amazing concept - called mT5 (have only used small version so far), a pre-trained model from Google that has seen over 101 languages' in it's unsupervised pretraining phase. It's still training right now but early results (a day into training) show similar outputs to lingua's low accuracy mode (using lingua's 3 classes of test sets). I still need to conduct testing, incorporating the model's execution into your accuracy reporter (thanks for that btw)

This model, once finetuned with the Huggingface Trainer API, can be converted to the library CTranslate2, which provides outstanding support for the inference of Transformer models, which I use for my translation projects and this model. This allows the utilization of cpu for fast inference, where a gpu may not be accessible (and makes the optimized-cpu throughtput similar to unoptimized-gpu throughput). The latency is low for what's expected of a large machine learning model pipeline - which is stated as such in the ReadMe - thanks to CTranslate2's efficiency. And it can use CPU or GPU, offering those with a GPU the ability to use it to speed detections even more. I need to conduct further testing regarding throughput and accuracy (currently continuing training so can't conduct accurate throughput measurements)

To sum up:

Pros

  1. Faster detection
  2. Efficient detection batching
  3. Ability to suppress detections of specific languages (suppressed_sequences in translate_batch method)
  4. Selection of gpu/cpu selection (as well as intra_threads and inter_threads if required)
  5. Low memory usage (295mb model file on disk [ctranslate2 conversion])
  6. Transformer neural model, possibly able to pick up on nuances of language that statistical n-gram models may not
  7. Utilization of a pre-trained transformer model - has seen tons of data from it's pretrained 107 langs

Cons

  1. Inability for detection scores (Only possibility is using score_batch [this returns a perplexity token log score, not a score sum of 1], but some limited testing of mine found some issues)
  2. One unified model - any finetuning or adding of languages needs to finetune the entire model (meaning finetuning has to show all language data when doing so to prevent catastrophic forgetting)
  3. Possibly more utilization of computer resources (it's a 300m parameter model, so it does need 'some' resources)
  4. Ghost of a chance for model to sometimes output sequences (inferences) that aren't a language code (con of using a seq-seq model vs classification models i suppose) - i need to investigate this further but it did not affect accuracy results at all

Neutral (couldnt choose if it's a con or pro)

  1. Relatively low training time, for my model with support of 97~ language (9.7m corpora total) - really competitive results at 20h training mark [RTX 3090]

Let me know if there's any interest in results or the model, just thought that it's something that should be shared

mT5 paper CTranslate2 Docs