Automatically detect tickets in other languages

violine1101 commented 4 years ago

This may be difficult to implement, but a great feature would be if Arisa would automatically detect if a ticket has been reported in a language other than English. For languages which don't use a Latin script this is easy to do (just check whether 90% or more of the characters in the ticket are non-ASCII), but there also exist Java libraries that provide the option to determine what language a text is from.

Either Arisa would just close these tickets as Invalid (easier to implement) or it would use machine translation in order to translate the ticket (difficult to implement).

urielsalis commented 4 years ago

https://github.com/pemistahl/lingua can be used, with the top 10 languages spoken (to make sure memory usage keeps down and accuracy is high)

violine1101 commented 4 years ago

Thanks for that link! That seems like the way to go. I'm not exactly sure yet what exactly Arisa should do when it detects a non-English or an unknown language. I guess Arisa could post a message that has been translated to the detected language if a known non-English language is detected. But what if it's not one in our list? Just assume that it's not English or leave it open for a human to check?

urielsalis commented 4 years ago

I would say if its more than 90% confident to just close as invalid

urielsalis commented 4 years ago

Do note we have to include english for that

urielsalis commented 4 years ago

and not translate, thats not useful 😓

Pokechu22 commented 4 years ago

I think it would be useful for the closure comment to be included in both the detected language and English, if we can reliably do such translation.

The other thing to be mindful of is tickets that have foreign characters (e.g. font issues including MC-148898), and translation issues on the WEB project (where they're valid, unlike with MC). Possibly that can just be solved by only resolving them on first creation, and if they're reopened then they won't be resolved again (since I don't think it's likely that a user will edit a ticket that was originally in English to be in some other language).

violine1101 commented 4 years ago

Yes, that's something I'd need to check, I'm not sure if lingua handles loanwords correctly. If it doesn't, I'd probably need to split the text up into multiple different parts, analyze them separately and only trigger the bot if more than 80% or so of these parts are not in English according to lingua. But as far as I can tell, lingua doesn't simply only check the characters in the letters, it seems to take a more sophisticated approach.

As for the resolution comments, I too think that it would be good to include the comment in English as well. Of some languages we can do the translations ourselves (Chinese, Dutch, German, Spanish, and maybe French and Japanese too), and for others we could also just ask the Crowdin proofreaders, it shouldn't be too big of a task to translate. And if a language doesn't have the message translated, we can simply only use the English one.

About machine translation, yeah, I looked a bit into it and from what I can tell it's not worth the effort. We'd probably need to pay to use the translation API when we cross a certain amount of requests, and apart from that there's no guarantee that the translation will actually be useful, especially with regards to bugs. (And tickets in other languages are usually not valid anyway)

Marcono1234 commented 4 years ago

Directly resolving a report might be a little bit dismissive especially for people who have not used Mojira much before, so we should probably at least try to keep the false positive rate low. As pokechu pointed out there are cases where reports which are partially in a different language are valid:

WEB translation issues
MC font issues (with non-English characters)
MC splash text issues
MC translation issues which are not resolvable on Crowdin

Especially when the reporter tries to explain why a translation is incorrect and porposes the correct translation the module might easily flag it incorrectly as non-English. Therefore 80% might be a little bit low. Though we will probably find a good value anyways after having it run for some time.

Also there have been cases where the reporter translated their issue when they were informed that non-English reports are not allowed. Should they be instructed to create a separate report then?

violine1101 commented 4 years ago

Perhaps the bot could still listen for updates after it resolved the ticket for being in the incorrect language, and then apply the language check again (and if the ticket is in English then, reopen it). But that might be out of scope for this issue and we can always add this later as well.

NeunEinser commented 4 years ago

From lingua:

Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text.

I don't know if these rule based approaches tick in if there is only a small number of characters from a different character set. It's probably best to test it on some of the cases and example tickets that were mentioned here.

Perhaps the bot could still listen for updates after it resolved the ticket for being in the incorrect language, and then apply the language check again (and if the ticket is in English then, reopen it). But that might be out of scope for this issue and we can always add this later as well.

That should be done in a different module.

NeunEinser commented 4 years ago

Also worth considering are instances were the reporter includes both a foreign language and an English translation upon creation, such as https://bugs.mojang.com/browse/MCPE-47915

violine1101 commented 4 years ago

Okay, so I looked into this for a bit, and lingua only returns a single language that it thinks is most likely correct. It does not provide a certainty value or a list of possible languages. Thus it probably is not really fit for what we need right now. There's a feature request about this in the lingua repository already though, and it's scheduled for the next release of lingua (0.7.0): lingua#11. However, lingua seems to be the simplest to use of all libraries that I have investigated.

There are other libraries that can be used for language detection as well. In particular, I looked a bit into Apache Tika and Apache OpenNLP. Both of these are general language processing libraries however. They do not focus on language detection, and thus include a lot of functionality that we don't need. They're probably more difficult to set up as well. But, both of these are still actively developed.

Another one I took a look at is shuyo/language-detection. It seems like it does exactly what we want, however it is no longer developed since 2014. Personally, from a quick look I don't like how it is programmed, it seems like you need a new detector object for every string you want to detect the language of.

There's also optimaize/language-detector. Its last commit was in 2016. Arrording to lingua's readme, it is rather inaccurate.

I haven't actually tested any of the libraries to see how accurate they are yet. I also don't know how intensive the libraries are when it comes to memory usage.

We now have multiple options:

Use lingua and just trust its output blindly
Use Tika or OpenNLP and deal with their complexity
Use shuyo/language-detection and deal with it not being updated anymore
Use optimaize/language-detector and deal with its inaccuracy
Wait for lingua to be updated in order to continue progress on this issue
Use lingua but only to test how accurate it is. It'd only leave a hidden comment that tells us what it would have done. After the testing phase we know how accurate lingua is and can decide in what way we want to proceed. Of course this option could also be used for any of the other libraries, it certainly wouldn't hurt to have a testing phase before the module goes into effect.
Look for another language detection library
Implement simple language detection ourselves (i.e. determine latin vs non-latin character ratio)

So, in your opinion, how should we proceed?

urielsalis commented 4 years ago

Implement simple language detection ourselves (i.e. determine latin vs non-latin character ratio) This wouldn't detect spanish reports, and they are quite frequent

Use lingua but only to test how accurate it is. It'd only leave a hidden comment that tells us what it would have done. After the testing phase we know how accurate lingua is and can decide in what way we want to proceed. Of course this option could also be used for any of the other libraries, it certainly wouldn't hurt to have a testing phase before the module goes into effect.

Indeed I would prefer this

Use lingua and just trust its output blindly

I would go simple by now. And make sure we log in the test the output(%) that lingua thinks matches.

Also, what is the CPU and memory usage of lingua? I would like to keep it under 100mb 😓

NeunEinser commented 4 years ago

Use lingua but only to test how accurate it is. It'd only leave a hidden comment that tells us what it would have done. After the testing phase we know how accurate lingua is and can decide in what way we want to proceed. Of course this option could also be used for any of the other libraries, it certainly wouldn't hurt to have a testing phase before the module goes into effect.

This would also be my preference. We could add a comment along the lines of MEQS_ARISA_LANGUAGE <language it detected>

urielsalis commented 4 years ago

with the confidence percentage!

NeunEinser commented 4 years ago

If I understood correctly that's not possible till the update:

Okay, so I looked into this for a bit, and lingua only returns a single language that it thinks is most likely correct. It does not provide a certainty value or a list of possible languages. Thus it probably is not really fit for what we need right now. There's a feature request about this in the lingua repository already though, and it's scheduled for the next release of lingua (0.7.0): lingua#11. However, lingua seems to be the simplest to use of all libraries that I have investigated.

Marcono1234 commented 4 years ago

The following workaround might solve this, though it will not be great performance-wise:

Run once with all languages
If the detected language is not English, run a second time with only English and the language from step 1 and set a minimum relative distance

urielsalis commented 4 years ago

As long as we run it only once per ticket(only in creation), then I dont really mind the performance (Just keep the RAM usage under 100mb if possible please)

violine1101 commented 4 years ago

The following workaround might solve this, though it will not be great performance-wise:

Run once with all languages

If the detected language is not English, run a second time with only English and the language from step 1 and set a minimum relative distance

Yes, that might work, and from what I can tell it should also be possible to use minimum relative distance without needing to run the language detection for a second time.

However, I've played around a bit with lingua's language detections without using minimum distance. It seems to be fairly accurate, as long as it doesn't encounter a language that isn't in its list of languages, which I don't think would be solved by using that workaround. Therefore, if it isn't too much of a burden on the RAM, it might be best to enable as many languages as possible. Speaking of which...

As long as we run it only once per ticket(only in creation), then I dont really mind the performance (Just keep the RAM usage under 100mb if possible please)

Unfortunately I don't have a reliable way of testing how much RAM the bot needs for the language detection. I think this would probably need to be part of the test run. Most language detection libraries probably are not very conservative when it comes to memory usage.

I have finished working on a prototype for this that could be activated as a first test run of this module, I'll post the PR in a moment.

Marcono1234 commented 4 years ago

it should also be possible to use minimum relative distance without needing to run the language detection for a second time.

That might cause false negatives for non-English languages which are similar. I don't have a good example, but let's say language A has 95%, language B 92% and English 10%. Then the minimum relative distance would apply to language A and B and therefore would return "UNKNOWN" even though it is very likely not English. Therefore it appears a second run is needed.

However, I've played around a bit with lingua's language detections without using minimum distance. It seems to be fairly accurate

Have you considered the case where half of the report is in English and the other half is in another language though? In that case it might be language A 50% and English 49%, or similar. Then a minimum relative distance might make sense.

Edit: This is only speculation, maybe these are not actually issues.

violine1101 commented 4 years ago

Ah, that makes sense, yeah.

Yes, I have a test case for that. But I've now added a few more test tickets where the distribution was roughly half English, half not English, and for those my current version did not work. I'll check out if your workaround mitigates this.

urielsalis commented 4 years ago

If you have intellij it has a profiler. You can see how much each package it using in terms of CPU and ram

Marcono1234 commented 4 years ago

In theory we could also fork lingua and make the probability accessible (since lingua has that information internally), though that might make maintaining it difficult.

violine1101 commented 4 years ago

Yes, that's also something that I thought about. But we would need to do major memory utilization optimizations to the library as well, and that's just not worth it.

For context, my version of the module in #104 using lingua, but the module used up way too much memory, so that even the GitHub pipeline failed.

We've now implemented a test version of the module that uses the dandelion online translation api instead (#106).

mojira / arisa-kt

Automatically detect tickets in other languages #60