During an experiment I tryied to load some personal whatsapp conversations into a vectorstore. But loading was failing. Following there's an example of a dataset and code with some half lines working and half failing:
Dataset (whatsapp_chat.txt):
19/10/16, 13:24 - Aitor Mira: Buenas Andrea!
19/10/16, 13:24 - Aitor Mira: Si
19/10/16, 13:24 PM - Aitor Mira: Buenas Andrea!
19/10/16, 13:24 PM - Aitor Mira: Si
Code:
from langchain.document_loaders import WhatsAppChatLoader
loader = WhatsAppChatLoader("../data/whatsapp_chat.txt")
docs = loader.load()
Returns:
[Document(page_content='Aitor Mira on 19/10/16, 13:24 PM: Buenas Andrea!\n\nAitor Mira on 19/10/16, 13:24 PM: Si\n\n', metadata={'source': '.[.\\data\\whatsapp_chat.txt](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/itort/Documents/GiTor/impersonate-gpt/notebooks//data//whatsapp_chat.txt)'})]
What's happening is that due to a bug in the regex match pattern, all lines without AM or PM after the hour:minutes won't be matched. Thus two first lines of whatsapp_chat.txt are ignored and two last matched.
Here the buggy regex:
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2} (?:AM|PM)) - (.*?): (.*)"
Here the solution regex parsing either 12 or 24 hours time formats:
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)"
During an experiment I tryied to load some personal whatsapp conversations into a vectorstore. But loading was failing. Following there's an example of a dataset and code with some half lines working and half failing:
Dataset (whatsapp_chat.txt):
Code:
Returns:
What's happening is that due to a bug in the regex match pattern, all lines without
AM
orPM
after the hour:minutes won't be matched. Thus two first lines of whatsapp_chat.txt are ignored and two last matched.Here the buggy regex:
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2} (?:AM|PM)) - (.*?): (.*)"
Here the solution regex parsing either 12 or 24 hours time formats:
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)"