Open Cedric-Boucher opened 11 months ago
Language identification is done using Facebook's fastText for a group of messages in the MessageProcessor. The final calculations (incl if the languages identified are reliable enough) are done in LanguageStats, which produces the aggregate data actually included in the report.
The MessageProcessor tries to identify the language based on large groups of messages within a certain interval (and I believe all from a single author) to improve accuracy. I haven't figured out where these "intervals" are defined, but I'm guessing that maybe while you had entire conversations in French, maybe the intervals included more than one of these conversations and more than 3% of these conversations were in English. I'll try and figure out exactly how intervals work though!
oh, thanks for the info! Maybe our conversations were too short or something. I'd say the longest one was maybe half an hour of constant messaging, but surrounded by English conversations. Technically French is a very small portion of our total messages, maybe that affects things too.
Oh, it looks like a new interval is only opened if an existing interval isn't already open (source), and an interval is only closed at the end of an input file (source) so, in most cases, an interval will span the entirety of a channel's existence. So, it's not that your conversations are too short, it's that intervals are too long.
Perhaps chat-analytics should detect a sudden change in language and group those messages separately (at least in the context of the languages tab)? So, if the language identifier is very confident about like 10 adjacent messages being in one language but then the 10 following messages are confidently identified as being in another language, they should be marked for that 3% accuracy separately. Or maybe it could be based on the pre-existing conversation-detecting code used for the interaction tab instead of entire message groups?
so, if I understood what you're saying correctly:
Is this correct? I would've thought that language detection would be on a per-message basis (based on some amount of common sense and the fact that there is an analytic for "number of messages where language detection was unreliable (because message was too short)". I would also have thought that it would be able to detect multiple languages in a chat based on the existence of the analytic "number of languages used"
I think your idea of grouping messages seems good. One note though, I have had a few conversations that were mostly English but with some French messages in between, or maybe even single messages that used both languages (frenglish basically). I understand that word-for-word language detection doesn't make sense (many word spellings exist in multiple languages), but I would think that searching for the language in one entire message would be a good idea (assuming processing time is reasonable)
I believe that's pretty much how it works, yes. Except that, it's not based on chats, it's based on files, and I believe a chat can be uploaded via multiple files if the chat is too large. The "number of languages used" would still make a bit of sense with this behaviour, especially when you consider you can also upload multiple chats at once. Although I do agree it should be able to pick up multiple languages from a single chat, and I believe mlomb is bilingual so I would have thought this is something they would pick up on if this was a mistake. I'll keep looking for a bit just to make sure I'm not misunderstanding it!
I'm not sure I agree about language detection being per-message, though. I don't think this is an issue of processing time (although it could take a bit longer and make the report a bit larger), but instead accuracy. A single message usually doesn't provide enough information to accurately predict the language. Although perhaps it could be based on a threshold- if the message is long enough then it can be contained in it's own language group, otherwise it loosens it's definition of a conversation (up to a limit) until it has enough data to work with. I can definitely imagine this being quite complex to code, though.
interesting that it works like that. I'll try exporting the chat in multiple small files and see if that changes anything.
currently trying exporting chat in separate files for each message, that's a lot of files!
that's fair. maybe it could use a single message if it's a long message (like a paragraph type of message) and group many short messages together when they're shorter than that threshold.
I will read this with more time other day but I want to clarify that messages are grouped per file and per author, in case authors want to intercalate languages.
i.e
A: hello
A: world
A: !
B: hola
B: mundo
A: yay!
["hello", "world", "!"] → 3 x english
["hola", "mundo"] → 2 x spanish
["yay!"] → 1 x english
oh, I see, interesting. Thanks for the info!
I tried exporting the chat history in groups of 10 messages per file. This way it should definitely be able to see that a whole message group was in french, in theory. The results in the analytics were the same as before, no french detected. Odd.
Can you give me a group of 10 messages so I can check?
sure, however I noticed just now that when processing only files that contain a significant amount of French messages, chatanalytics DOES detect French, but when I give it all of the nearly 5000 files, it doesn't detect any French at all? Weird. Giving it two files, one fully English and one fully French, it does detect both languages.
In the code below,
Before the return, I added console.log(code, "|", line);
and get:
fr | tu pars bientôt
fr | bientôt oui nous avons regardé la voiture ça va it goes
en | ah
fr | la voiture ça va spécifiquement
fr | est-ce que tu sais c'est quoi qui fait le bruit
fr | non tout est normal pour nous peut-etre un bushing mais maintenant je suis plus rassuré pour le voyage
en | ok
Which seems ok
Can you test that line with all the files and look for something strange?
Sorry for extremely long delay in getting back to you on this.
I have finally do this. I don't see anything strange (other than the few/rare small messages that have incorrect/random languages)
Here's a snippet.
I have also tried adding console log before the return here https://github.com/mlomb/chat-analytics/blob/055c68c78e12c6e3f32cb9137135e426c49a64bf/pipeline/process/MessageProcessor.ts#L50-L57 to try to see if it was the message groups that were getting the wrong language. But I see index 41 (english) and 48 (french) at the correct places, and most groups are 1 message in length anyway.
I see a whole bunch of language indices show up in langCounts (database builder) once all the messages have been processed. The highest number for this specific DM is for English (121792) and the second highest number is for French (1528), which are the only two languages we have used. The highest number for a false positive language is 868 for language index 34.
Since the count for french messages is less than 3% of the count for english messages (and therefore also <3% of the count of total messages), the reason why French does not show up as a language in our chat even though it has been used a lot, is because it falls under the 3% cutoff.
I'm not sure how we would resolve this. Maybe no change is necessary and I just accept that french isn't detected for my chat?
I wish we could reduce that 3% cutoff, but considering that it thinks we have sent over 800 messages in german (neither of us know german) and 3 messages in Bosnian (there has been a few hundred messages sent in Bosnian, actually), it seems that the language detection model is not accurate enough to do this.
If I set the 3% cutoff to 0.1% instead, I get the following:
Note that we have only used English, French, Bosnian (very little) and Japanese (very little) in our conversation
I have a suggestion what if, to reduce the noise in the language detection, we don't do language detection on messages / message groups where there are only 1 or 2 words? These are usually the messages that seem to have wildly wrong languages like german. Maybe then, the 3% cutoff could be feasibly lowered.
lol why does it think this is German?
I'm noticing that many messages that have "im" (uncapitalized I, no apostrophe), are incorrectly classified as German with a very high confidence by the language model...
I have gotten the results for my ~130K message conversation data to this point, which is very accurate now!
I did this by setting the minimum language model confidence level to 70% (from 0%), setting the minimum word count for language detection to 6 words (from 0), and setting the language threshold (for displaying it at all) to 0.1% (from 3%)
I created a pull request #110
I had multiple entire conversations with a friend entirely in French, yet the analytics for our message history shows only English in the recognized languages page. I'm curious, how does the language detection work? Maybe it struggles to differentiate between French and English? It's really odd though, a fully French conversation will have mostly non-English, French words, and many words with accents (which are not in any English words), so I thought it wouldn't be very difficult for a simple algorithm to detect the difference.