langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.39k stars 14.77k forks source link

WhatsAppChatLoader doesn't work on chats exported from WhatsApp #4153

Closed ruravi closed 1 year ago

ruravi commented 1 year ago

System Info

langchain 0.0.158 Mac OS M1 Python 3.11

Who can help?

@ey

Information

Related Components

Reproduction

  1. Use 'Export Chat' feature on WhatsApp.
  2. Observe this format for the txt file
    [11/8/21, 9:41:32 AM] User name: Message text

The regular expression used by WhatsAppChatLoader doesn't parse this format successfully

Expected behavior

Parsing fails

hp0404 commented 1 year ago

it also doesn't work on Ukrainian date format, e.g.

[05.05.23, 15:45:46] User: text

I used the following input formats:

[05.05.23, 15:48:11] James: Hi here
[11/8/21, 9:41:32 AM] User name: Message 123
1/23/23, 3:19 AM - User 2: Bye!
1/23/23, 3:22_AM - User 1: And let me know if anything changes

New regex that seems to work with all three:


message_line_regex = r"""
    \[?                      # Optional opening square bracket
    (                        # Start of group 1
        \d{1,2}              # Match 1-2 digits for the day
        [\/.]                # Match a forward slash or period as the date separator
        \d{1,2}              # Match 1-2 digits for the month
        [\/.]                # Match a forward slash or period as the date separator
        \d{2,4}              # Match 2-4 digits for the year
        ,\s                  # Match a comma and a space
        \d{1,2}              # Match 1-2 digits for the hour
        :\d{2}               # Match 2 digits for the minutes
        (?:                  # Optional group for seconds
            :\d{2}           # Match 2 digits for the seconds
        )?                   # Make seconds group optional
        (?:[ _](?:AM|PM))?   # Optional space or underscore and AM/PM suffix for 12-hour format
    )                        # End of group 1
    \]?                      # Optional closing square bracket
    [\s-]*                   # Match any number of spaces or hyphens
    ([\w\s]+)                # Match and capture one or more word characters or spaces as group 2 (the sender)
    [:]+                     # Match one or more colons
    \s                       # Match a single space
    (.+)                     # Match and capture one or more of any character as group 3 (the message content)
"""

I can make a PR, but should I test any other formats before?