Closed violine1101 closed 4 years ago
https://github.com/pemistahl/lingua can be used, with the top 10 languages spoken (to make sure memory usage keeps down and accuracy is high)
Thanks for that link! That seems like the way to go. I'm not exactly sure yet what exactly Arisa should do when it detects a non-English or an unknown language. I guess Arisa could post a message that has been translated to the detected language if a known non-English language is detected. But what if it's not one in our list? Just assume that it's not English or leave it open for a human to check?
I would say if its more than 90% confident to just close as invalid
Do note we have to include english for that
and not translate, thats not useful 😓
I think it would be useful for the closure comment to be included in both the detected language and English, if we can reliably do such translation.
The other thing to be mindful of is tickets that have foreign characters (e.g. font issues including MC-148898), and translation issues on the WEB project (where they're valid, unlike with MC). Possibly that can just be solved by only resolving them on first creation, and if they're reopened then they won't be resolved again (since I don't think it's likely that a user will edit a ticket that was originally in English to be in some other language).
Yes, that's something I'd need to check, I'm not sure if lingua handles loanwords correctly. If it doesn't, I'd probably need to split the text up into multiple different parts, analyze them separately and only trigger the bot if more than 80% or so of these parts are not in English according to lingua. But as far as I can tell, lingua doesn't simply only check the characters in the letters, it seems to take a more sophisticated approach.
As for the resolution comments, I too think that it would be good to include the comment in English as well. Of some languages we can do the translations ourselves (Chinese, Dutch, German, Spanish, and maybe French and Japanese too), and for others we could also just ask the Crowdin proofreaders, it shouldn't be too big of a task to translate. And if a language doesn't have the message translated, we can simply only use the English one.
About machine translation, yeah, I looked a bit into it and from what I can tell it's not worth the effort. We'd probably need to pay to use the translation API when we cross a certain amount of requests, and apart from that there's no guarantee that the translation will actually be useful, especially with regards to bugs. (And tickets in other languages are usually not valid anyway)
Directly resolving a report might be a little bit dismissive especially for people who have not used Mojira much before, so we should probably at least try to keep the false positive rate low. As pokechu pointed out there are cases where reports which are partially in a different language are valid:
Especially when the reporter tries to explain why a translation is incorrect and porposes the correct translation the module might easily flag it incorrectly as non-English. Therefore 80% might be a little bit low. Though we will probably find a good value anyways after having it run for some time.
Also there have been cases where the reporter translated their issue when they were informed that non-English reports are not allowed. Should they be instructed to create a separate report then?
Perhaps the bot could still listen for updates after it resolved the ticket for being in the incorrect language, and then apply the language check again (and if the ticket is in English then, reopen it). But that might be out of scope for this issue and we can always add this later as well.
From lingua:
Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text.
I don't know if these rule based approaches tick in if there is only a small number of characters from a different character set. It's probably best to test it on some of the cases and example tickets that were mentioned here.
Perhaps the bot could still listen for updates after it resolved the ticket for being in the incorrect language, and then apply the language check again (and if the ticket is in English then, reopen it). But that might be out of scope for this issue and we can always add this later as well.
That should be done in a different module.
Also worth considering are instances were the reporter includes both a foreign language and an English translation upon creation, such as https://bugs.mojang.com/browse/MCPE-47915
Okay, so I looked into this for a bit, and lingua
only returns a single language that it thinks is most likely correct. It does not provide a certainty value or a list of possible languages. Thus it probably is not really fit for what we need right now. There's a feature request about this in the lingua repository already though, and it's scheduled for the next release of lingua (0.7.0): lingua#11. However, lingua
seems to be the simplest to use of all libraries that I have investigated.
There are other libraries that can be used for language detection as well. In particular, I looked a bit into Apache Tika and Apache OpenNLP. Both of these are general language processing libraries however. They do not focus on language detection, and thus include a lot of functionality that we don't need. They're probably more difficult to set up as well. But, both of these are still actively developed.
Another one I took a look at is shuyo/language-detection
. It seems like it does exactly what we want, however it is no longer developed since 2014. Personally, from a quick look I don't like how it is programmed, it seems like you need a new detector object for every string you want to detect the language of.
There's also optimaize/language-detector
. Its last commit was in 2016. Arrording to lingua
's readme, it is rather inaccurate.
I haven't actually tested any of the libraries to see how accurate they are yet. I also don't know how intensive the libraries are when it comes to memory usage.
We now have multiple options:
lingua
and just trust its output blindlyshuyo/language-detection
and deal with it not being updated anymoreoptimaize/language-detector
and deal with its inaccuracylingua
to be updated in order to continue progress on this issuelingua
but only to test how accurate it is. It'd only leave a hidden comment that tells us what it would have done. After the testing phase we know how accurate lingua
is and can decide in what way we want to proceed. Of course this option could also be used for any of the other libraries, it certainly wouldn't hurt to have a testing phase before the module goes into effect.So, in your opinion, how should we proceed?
Implement simple language detection ourselves (i.e. determine latin vs non-latin character ratio) This wouldn't detect spanish reports, and they are quite frequent
Use lingua but only to test how accurate it is. It'd only leave a hidden comment that tells us what it would have done. After the testing phase we know how accurate lingua is and can decide in what way we want to proceed. Of course this option could also be used for any of the other libraries, it certainly wouldn't hurt to have a testing phase before the module goes into effect.
Indeed I would prefer this
Use lingua and just trust its output blindly
I would go simple by now. And make sure we log in the test the output(%) that lingua thinks matches.
Also, what is the CPU and memory usage of lingua? I would like to keep it under 100mb 😓
Use
lingua
but only to test how accurate it is. It'd only leave a hidden comment that tells us what it would have done. After the testing phase we know how accuratelingua
is and can decide in what way we want to proceed. Of course this option could also be used for any of the other libraries, it certainly wouldn't hurt to have a testing phase before the module goes into effect.
This would also be my preference.
We could add a comment along the lines of MEQS_ARISA_LANGUAGE <language it detected>
with the confidence percentage!
If I understood correctly that's not possible till the update:
Okay, so I looked into this for a bit, and
lingua
only returns a single language that it thinks is most likely correct. It does not provide a certainty value or a list of possible languages. Thus it probably is not really fit for what we need right now. There's a feature request about this in the lingua repository already though, and it's scheduled for the next release of lingua (0.7.0): lingua#11. However,lingua
seems to be the simplest to use of all libraries that I have investigated.
The following workaround might solve this, though it will not be great performance-wise:
As long as we run it only once per ticket(only in creation), then I dont really mind the performance (Just keep the RAM usage under 100mb if possible please)
The following workaround might solve this, though it will not be great performance-wise:
- Run once with all languages
- If the detected language is not English, run a second time with only English and the language from step 1 and set a minimum relative distance
Yes, that might work, and from what I can tell it should also be possible to use minimum relative distance without needing to run the language detection for a second time.
However, I've played around a bit with lingua's language detections without using minimum distance. It seems to be fairly accurate, as long as it doesn't encounter a language that isn't in its list of languages, which I don't think would be solved by using that workaround. Therefore, if it isn't too much of a burden on the RAM, it might be best to enable as many languages as possible. Speaking of which...
As long as we run it only once per ticket(only in creation), then I dont really mind the performance (Just keep the RAM usage under 100mb if possible please)
Unfortunately I don't have a reliable way of testing how much RAM the bot needs for the language detection. I think this would probably need to be part of the test run. Most language detection libraries probably are not very conservative when it comes to memory usage.
I have finished working on a prototype for this that could be activated as a first test run of this module, I'll post the PR in a moment.
it should also be possible to use minimum relative distance without needing to run the language detection for a second time.
That might cause false negatives for non-English languages which are similar. I don't have a good example, but let's say language A has 95%, language B 92% and English 10%. Then the minimum relative distance would apply to language A and B and therefore would return "UNKNOWN" even though it is very likely not English. Therefore it appears a second run is needed.
However, I've played around a bit with lingua's language detections without using minimum distance. It seems to be fairly accurate
Have you considered the case where half of the report is in English and the other half is in another language though? In that case it might be language A 50% and English 49%, or similar. Then a minimum relative distance might make sense.
Edit: This is only speculation, maybe these are not actually issues.
Ah, that makes sense, yeah.
Yes, I have a test case for that. But I've now added a few more test tickets where the distribution was roughly half English, half not English, and for those my current version did not work. I'll check out if your workaround mitigates this.
If you have intellij it has a profiler. You can see how much each package it using in terms of CPU and ram
In theory we could also fork lingua and make the probability accessible (since lingua has that information internally), though that might make maintaining it difficult.
Yes, that's also something that I thought about. But we would need to do major memory utilization optimizations to the library as well, and that's just not worth it.
For context, my version of the module in #104 using lingua, but the module used up way too much memory, so that even the GitHub pipeline failed.
We've now implemented a test version of the module that uses the dandelion online translation api instead (#106).
This may be difficult to implement, but a great feature would be if Arisa would automatically detect if a ticket has been reported in a language other than English. For languages which don't use a Latin script this is easy to do (just check whether 90% or more of the characters in the ticket are non-ASCII), but there also exist Java libraries that provide the option to determine what language a text is from.
Either Arisa would just close these tickets as Invalid (easier to implement) or it would use machine translation in order to translate the ticket (difficult to implement).