Language tab doesn't seem to detect french

Cedric-Boucher commented 11 months ago

I had multiple entire conversations with a friend entirely in French, yet the analytics for our message history shows only English in the recognized languages page. I'm curious, how does the language detection work? Maybe it struggles to differentiate between French and English? It's really odd though, a fully French conversation will have mostly non-English, French words, and many words with accents (which are not in any English words), so I thought it wouldn't be very difficult for a simple algorithm to detect the difference.

hopperelec commented 11 months ago

Language identification is done using Facebook's fastText for a group of messages in the MessageProcessor. The final calculations (incl if the languages identified are reliable enough) are done in LanguageStats, which produces the aggregate data actually included in the report.

The MessageProcessor tries to identify the language based on large groups of messages within a certain interval (and I believe all from a single author) to improve accuracy. I haven't figured out where these "intervals" are defined, but I'm guessing that maybe while you had entire conversations in French, maybe the intervals included more than one of these conversations and more than 3% of these conversations were in English. I'll try and figure out exactly how intervals work though!

Cedric-Boucher commented 11 months ago

oh, thanks for the info! Maybe our conversations were too short or something. I'd say the longest one was maybe half an hour of constant messaging, but surrounded by English conversations. Technically French is a very small portion of our total messages, maybe that affects things too.

hopperelec commented 11 months ago

Oh, it looks like a new interval is only opened if an existing interval isn't already open (source), and an interval is only closed at the end of an input file (source) so, in most cases, an interval will span the entirety of a channel's existence. So, it's not that your conversations are too short, it's that intervals are too long.

Perhaps chat-analytics should detect a sudden change in language and group those messages separately (at least in the context of the languages tab)? So, if the language identifier is very confident about like 10 adjacent messages being in one language but then the 10 following messages are confidently identified as being in another language, they should be marked for that 3% accuracy separately. Or maybe it could be based on the pre-existing conversation-detecting code used for the interaction tab instead of entire message groups?

Cedric-Boucher commented 11 months ago

so, if I understood what you're saying correctly:

with the current way intervals work, any one channel (or entire DM, in the case of a DM) will be a single interval from the first message to the last message.
with the current way language detection works, the language detection model only tries to identify a single language in an entire interval.

Is this correct? I would've thought that language detection would be on a per-message basis (based on some amount of common sense and the fact that there is an analytic for "number of messages where language detection was unreliable (because message was too short)". I would also have thought that it would be able to detect multiple languages in a chat based on the existence of the analytic "number of languages used"

I think your idea of grouping messages seems good. One note though, I have had a few conversations that were mostly English but with some French messages in between, or maybe even single messages that used both languages (frenglish basically). I understand that word-for-word language detection doesn't make sense (many word spellings exist in multiple languages), but I would think that searching for the language in one entire message would be a good idea (assuming processing time is reasonable)

hopperelec commented 11 months ago

I believe that's pretty much how it works, yes. Except that, it's not based on chats, it's based on files, and I believe a chat can be uploaded via multiple files if the chat is too large. The "number of languages used" would still make a bit of sense with this behaviour, especially when you consider you can also upload multiple chats at once. Although I do agree it should be able to pick up multiple languages from a single chat, and I believe mlomb is bilingual so I would have thought this is something they would pick up on if this was a mistake. I'll keep looking for a bit just to make sure I'm not misunderstanding it!

I'm not sure I agree about language detection being per-message, though. I don't think this is an issue of processing time (although it could take a bit longer and make the report a bit larger), but instead accuracy. A single message usually doesn't provide enough information to accurately predict the language. Although perhaps it could be based on a threshold- if the message is long enough then it can be contained in it's own language group, otherwise it loosens it's definition of a conversation (up to a limit) until it has enough data to work with. I can definitely imagine this being quite complex to code, though.

Cedric-Boucher commented 11 months ago

interesting that it works like that. I'll try exporting the chat in multiple small files and see if that changes anything.

currently trying exporting chat in separate files for each message, that's a lot of files!

that's fair. maybe it could use a single message if it's a long message (like a paragraph type of message) and group many short messages together when they're shorter than that threshold.

mlomb commented 11 months ago

I will read this with more time other day but I want to clarify that messages are grouped per file and per author, in case authors want to intercalate languages.

https://github.com/mlomb/chat-analytics/blob/484d32ec4f3065306b33c63a16b1da2dc17197c5/pipeline/process/ChannelMessages.ts#L156-L170

i.e

A: hello
A: world
A: !
B: hola
B: mundo
A: yay!

["hello", "world", "!"] → 3 x english ["hola", "mundo"] → 2 x spanish ["yay!"] → 1 x english

Cedric-Boucher commented 11 months ago

oh, I see, interesting. Thanks for the info!

Cedric-Boucher commented 11 months ago

I tried exporting the chat history in groups of 10 messages per file. This way it should definitely be able to see that a whole message group was in french, in theory. The results in the analytics were the same as before, no french detected. Odd.

mlomb commented 11 months ago

Can you give me a group of 10 messages so I can check?

Cedric-Boucher commented 11 months ago

sure, however I noticed just now that when processing only files that contain a significant amount of French messages, chatanalytics DOES detect French, but when I give it all of the nearly 5000 files, it doesn't detect any French at all? Weird. Giving it two files, one fully English and one fully French, it does detect both languages.

two files all files, including the two files from above Direct Messages - REDACTED [REDACTED] [part 4726].json

mlomb commented 11 months ago

In the code below,

https://github.com/mlomb/chat-analytics/blob/055c68c78e12c6e3f32cb9137135e426c49a64bf/pipeline/process/nlp/FastTextModel.ts#L79-L87

Before the return, I added console.log(code, "|", line); and get:

fr | tu pars bientôt
fr | bientôt oui nous avons regardé la voiture ça va it goes
en | ah
fr | la voiture ça va spécifiquement
fr | est-ce que tu sais c'est quoi qui fait le bruit
fr | non tout est normal pour nous peut-etre un bushing mais maintenant je suis plus rassuré pour le voyage
en | ok

Which seems ok

Can you test that line with all the files and look for something strange?