joweich / chat-miner

Parsers and visualizations for chats
MIT License
566 stars 57 forks source link

Fixed incorrect parsing for WhatsApp messages starting with unicode left-to-right character (U-200E) #42

Closed joweich closed 1 year ago

joweich commented 1 year ago

For some reason, lines containing media files start with the unicode left-to-right character in chat exports for WhatsApp. Before this fix, those lines where simply appended to the previous messages. This also caused the wordcloud viz to show the authors names prominently. This fix detects U-200E characters, strips them and reconstructs the message accordingly.