langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.44k stars 15.27k forks source link

WhatsappChatLoader doesn't ignore deleted messages and omitted media #6838

Closed augtheo closed 1 year ago

augtheo commented 1 year ago

System Info

langchain git+https://github.com/hwchase17/langchain@8392ca602c03d3ae660d05981154f17ee0ad438e Archcraft x86_64 Python 3.11.3

Who can help?

@eyurtsev @dev2049

Information

Related Components

Reproduction

  1. Export the chat from WhatsApp, of a conversation with media and deleted messages.
  2. The exported chat contains deleted messages and omitted media during the export. For example : 6/29/23, 12:16 am - User 4: This message was deleted and 4/20/23, 9:42 am - User 3: <Media omitted>.
  3. Currently these messages are also processed and stored in the index.

Expected behavior

We can avoid embedding these messages in the index.

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Source: langchain/document_loaders/whatsapp_chat.py

To ignore deleted messages and omitted media, you can add a condition to check if the text of the message contains "This message was deleted" or "". If it does, you can skip adding it to the text_content. Here's how you can modify the load function:

def load(self) -> List[Document]:
    """Load documents."""
    p = Path(self.file_path)
    text_content = ""

    with open(p, encoding="utf8") as f:
        lines = f.readlines()

    message_line_regex = r"""
        \[?
        (
            \d{1,4}
            [\/.]
            \d{1,2}
            [\/.]
            \d{1,4}
            ,\s
            \d{1,2}
            :\d{2}
            (?:
                :\d{2}
            )?
            (?:[\s_](?:AM|PM))?
        )
        \]?
        [\s-]*
        ([~\w\s]+)
        [:]+
        \s
        (.+)
    """
    for line in lines:
        result = re.match(
            message_line_regex, line.strip(), flags=re.VERBOSE | re.IGNORECASE
        )
        if result:
            date, sender, text = result.groups()
            if "This message was deleted" not in text and "<Media omitted>" not in text:
                text_content += concatenate_rows(date, sender, text)

    metadata = {"source": str(p)}

    return [Document(page_content=text_content, metadata=metadata)]

This modification will ensure that deleted messages and omitted media are not processed and stored in the index.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.