mlomb / chat-analytics

Generate interactive, beautiful and insightful chat analysis reports
https://chatanalytics.app
GNU Affero General Public License v3.0
708 stars 51 forks source link

Language tab doesn't seem to detect french #97

Open Cedric-Boucher opened 11 months ago

Cedric-Boucher commented 11 months ago

I had multiple entire conversations with a friend entirely in French, yet the analytics for our message history shows only English in the recognized languages page. I'm curious, how does the language detection work? Maybe it struggles to differentiate between French and English? It's really odd though, a fully French conversation will have mostly non-English, French words, and many words with accents (which are not in any English words), so I thought it wouldn't be very difficult for a simple algorithm to detect the difference.

hopperelec commented 11 months ago

Language identification is done using Facebook's fastText for a group of messages in the MessageProcessor. The final calculations (incl if the languages identified are reliable enough) are done in LanguageStats, which produces the aggregate data actually included in the report.

The MessageProcessor tries to identify the language based on large groups of messages within a certain interval (and I believe all from a single author) to improve accuracy. I haven't figured out where these "intervals" are defined, but I'm guessing that maybe while you had entire conversations in French, maybe the intervals included more than one of these conversations and more than 3% of these conversations were in English. I'll try and figure out exactly how intervals work though!

Cedric-Boucher commented 11 months ago

oh, thanks for the info! Maybe our conversations were too short or something. I'd say the longest one was maybe half an hour of constant messaging, but surrounded by English conversations. Technically French is a very small portion of our total messages, maybe that affects things too.

hopperelec commented 11 months ago

Oh, it looks like a new interval is only opened if an existing interval isn't already open (source), and an interval is only closed at the end of an input file (source) so, in most cases, an interval will span the entirety of a channel's existence. So, it's not that your conversations are too short, it's that intervals are too long.

Perhaps chat-analytics should detect a sudden change in language and group those messages separately (at least in the context of the languages tab)? So, if the language identifier is very confident about like 10 adjacent messages being in one language but then the 10 following messages are confidently identified as being in another language, they should be marked for that 3% accuracy separately. Or maybe it could be based on the pre-existing conversation-detecting code used for the interaction tab instead of entire message groups?

Cedric-Boucher commented 11 months ago

so, if I understood what you're saying correctly:

Is this correct? I would've thought that language detection would be on a per-message basis (based on some amount of common sense and the fact that there is an analytic for "number of messages where language detection was unreliable (because message was too short)". I would also have thought that it would be able to detect multiple languages in a chat based on the existence of the analytic "number of languages used"

I think your idea of grouping messages seems good. One note though, I have had a few conversations that were mostly English but with some French messages in between, or maybe even single messages that used both languages (frenglish basically). I understand that word-for-word language detection doesn't make sense (many word spellings exist in multiple languages), but I would think that searching for the language in one entire message would be a good idea (assuming processing time is reasonable)

hopperelec commented 11 months ago

I believe that's pretty much how it works, yes. Except that, it's not based on chats, it's based on files, and I believe a chat can be uploaded via multiple files if the chat is too large. The "number of languages used" would still make a bit of sense with this behaviour, especially when you consider you can also upload multiple chats at once. Although I do agree it should be able to pick up multiple languages from a single chat, and I believe mlomb is bilingual so I would have thought this is something they would pick up on if this was a mistake. I'll keep looking for a bit just to make sure I'm not misunderstanding it!

I'm not sure I agree about language detection being per-message, though. I don't think this is an issue of processing time (although it could take a bit longer and make the report a bit larger), but instead accuracy. A single message usually doesn't provide enough information to accurately predict the language. Although perhaps it could be based on a threshold- if the message is long enough then it can be contained in it's own language group, otherwise it loosens it's definition of a conversation (up to a limit) until it has enough data to work with. I can definitely imagine this being quite complex to code, though.

Cedric-Boucher commented 11 months ago

interesting that it works like that. I'll try exporting the chat in multiple small files and see if that changes anything.

currently trying exporting chat in separate files for each message, that's a lot of files!

that's fair. maybe it could use a single message if it's a long message (like a paragraph type of message) and group many short messages together when they're shorter than that threshold.

mlomb commented 11 months ago

I will read this with more time other day but I want to clarify that messages are grouped per file and per author, in case authors want to intercalate languages.

https://github.com/mlomb/chat-analytics/blob/484d32ec4f3065306b33c63a16b1da2dc17197c5/pipeline/process/ChannelMessages.ts#L156-L170

i.e

A: hello
A: world
A: !
B: hola
B: mundo
A: yay!

["hello", "world", "!"] → 3 x english ["hola", "mundo"] → 2 x spanish ["yay!"] → 1 x english

Cedric-Boucher commented 11 months ago

oh, I see, interesting. Thanks for the info!

Cedric-Boucher commented 11 months ago

I tried exporting the chat history in groups of 10 messages per file. This way it should definitely be able to see that a whole message group was in french, in theory. The results in the analytics were the same as before, no french detected. Odd.

mlomb commented 11 months ago

Can you give me a group of 10 messages so I can check?

Cedric-Boucher commented 11 months ago

sure, however I noticed just now that when processing only files that contain a significant amount of French messages, chatanalytics DOES detect French, but when I give it all of the nearly 5000 files, it doesn't detect any French at all? Weird. Giving it two files, one fully English and one fully French, it does detect both languages.

two files all files, including the two files from above Direct Messages - REDACTED [REDACTED] [part 4726].json

mlomb commented 11 months ago

In the code below,

https://github.com/mlomb/chat-analytics/blob/055c68c78e12c6e3f32cb9137135e426c49a64bf/pipeline/process/nlp/FastTextModel.ts#L79-L87

Before the return, I added console.log(code, "|", line); and get:

fr | tu pars bientôt
fr | bientôt oui nous avons regardé la voiture ça va it goes
en | ah
fr | la voiture ça va spécifiquement
fr | est-ce que tu sais c'est quoi qui fait le bruit
fr | non tout est normal pour nous peut-etre un bushing mais maintenant je suis plus rassuré pour le voyage
en | ok

Which seems ok


Can you test that line with all the files and look for something strange?

Cedric-Boucher commented 6 months ago

Sorry for extremely long delay in getting back to you on this.

I have finally do this. I don't see anything strange (other than the few/rare small messages that have incorrect/random languages)

Here's a snippet.

image

Cedric-Boucher commented 6 months ago

I have also tried adding console log before the return here https://github.com/mlomb/chat-analytics/blob/055c68c78e12c6e3f32cb9137135e426c49a64bf/pipeline/process/MessageProcessor.ts#L50-L57 to try to see if it was the message groups that were getting the wrong language. But I see index 41 (english) and 48 (french) at the correct places, and most groups are 1 message in length anyway.

Cedric-Boucher commented 6 months ago

I see a whole bunch of language indices show up in langCounts (database builder) once all the messages have been processed. The highest number for this specific DM is for English (121792) and the second highest number is for French (1528), which are the only two languages we have used. The highest number for a false positive language is 868 for language index 34.

Cedric-Boucher commented 6 months ago

Since the count for french messages is less than 3% of the count for english messages (and therefore also <3% of the count of total messages), the reason why French does not show up as a language in our chat even though it has been used a lot, is because it falls under the 3% cutoff.

I'm not sure how we would resolve this. Maybe no change is necessary and I just accept that french isn't detected for my chat?

Cedric-Boucher commented 6 months ago

I wish we could reduce that 3% cutoff, but considering that it thinks we have sent over 800 messages in german (neither of us know german) and 3 messages in Bosnian (there has been a few hundred messages sent in Bosnian, actually), it seems that the language detection model is not accurate enough to do this.

Cedric-Boucher commented 6 months ago

If I set the 3% cutoff to 0.1% instead, I get the following:

image

Note that we have only used English, French, Bosnian (very little) and Japanese (very little) in our conversation

Cedric-Boucher commented 6 months ago

I have a suggestion what if, to reduce the noise in the language detection, we don't do language detection on messages / message groups where there are only 1 or 2 words? These are usually the messages that seem to have wildly wrong languages like german. Maybe then, the 3% cutoff could be feasibly lowered.

Cedric-Boucher commented 6 months ago

image

lol why does it think this is German?

Cedric-Boucher commented 6 months ago

I'm noticing that many messages that have "im" (uncapitalized I, no apostrophe), are incorrectly classified as German with a very high confidence by the language model...

Cedric-Boucher commented 6 months ago

I have gotten the results for my ~130K message conversation data to this point, which is very accurate now!

image

I did this by setting the minimum language model confidence level to 70% (from 0%), setting the minimum word count for language detection to 6 words (from 0), and setting the language threshold (for displaying it at all) to 0.1% (from 3%)

Cedric-Boucher commented 6 months ago

I created a pull request #110