langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.38k stars 14.77k forks source link

WhatsAppChatLoader fails to load 24 hours time format chats #2457

Closed itortouch closed 1 year ago

itortouch commented 1 year ago

During an experiment I tryied to load some personal whatsapp conversations into a vectorstore. But loading was failing. Following there's an example of a dataset and code with some half lines working and half failing:

Dataset (whatsapp_chat.txt):

19/10/16, 13:24 - Aitor Mira: Buenas Andrea!
19/10/16, 13:24 - Aitor Mira: Si
19/10/16, 13:24 PM - Aitor Mira: Buenas Andrea!
19/10/16, 13:24 PM - Aitor Mira: Si

Code:

from langchain.document_loaders import WhatsAppChatLoader
loader = WhatsAppChatLoader("../data/whatsapp_chat.txt")
docs = loader.load()

Returns:

[Document(page_content='Aitor Mira on 19/10/16, 13:24 PM: Buenas Andrea!\n\nAitor Mira on 19/10/16, 13:24 PM: Si\n\n', metadata={'source': '.[.\\data\\whatsapp_chat.txt](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/itort/Documents/GiTor/impersonate-gpt/notebooks//data//whatsapp_chat.txt)'})]

What's happening is that due to a bug in the regex match pattern, all lines without AM or PM after the hour:minutes won't be matched. Thus two first lines of whatsapp_chat.txt are ignored and two last matched.

Here the buggy regex: r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2} (?:AM|PM)) - (.*?): (.*)"

Here the solution regex parsing either 12 or 24 hours time formats: r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)"

itortouch commented 1 year ago

PR solution ready:

https://github.com/hwchase17/langchain/pull/2458