WhatsappChatLoader doesn't ignore deleted messages and omitted media

System Info

langchain git+https://github.com/hwchase17/langchain@8392ca602c03d3ae660d05981154f17ee0ad438e Archcraft x86_64 Python 3.11.3

Who can help?

@eyurtsev @dev2049

Information

[X] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[X] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Export the chat from WhatsApp, of a conversation with media and deleted messages.
The exported chat contains deleted messages and omitted media during the export. For example : 6/29/23, 12:16 am - User 4: This message was deleted and 4/20/23, 9:42 am - User 3: <Media omitted>.
Currently these messages are also processed and stored in the index.

Expected behavior

We can avoid embedding these messages in the index.

Answer generated by a 🤖

Answer

Source: langchain/document_loaders/whatsapp_chat.py

To ignore deleted messages and omitted media, you can add a condition to check if the text of the message contains "This message was deleted" or "". If it does, you can skip adding it to the text_content. Here's how you can modify the load function:

def load(self) -> List[Document]:
    """Load documents."""
    p = Path(self.file_path)
    text_content = ""

    with open(p, encoding="utf8") as f:
        lines = f.readlines()

    message_line_regex = r"""
        \[?
        (
            \d{1,4}
            [\/.]
            \d{1,2}
            [\/.]
            \d{1,4}
            ,\s
            \d{1,2}
            :\d{2}
            (?:
                :\d{2}
            )?
            (?:[\s_](?:AM|PM))?
        )
        \]?
        [\s-]*
        ([~\w\s]+)
        [:]+
        \s
        (.+)
    """
    for line in lines:
        result = re.match(
            message_line_regex, line.strip(), flags=re.VERBOSE | re.IGNORECASE
        )
        if result:
            date, sender, text = result.groups()
            if "This message was deleted" not in text and "<Media omitted>" not in text:
                text_content += concatenate_rows(date, sender, text)

    metadata = {"source": str(p)}

    return [Document(page_content=text_content, metadata=metadata)]

This modification will ensure that deleted messages and omitted media are not processed and stored in the index.

Sources

langchain/document_loaders/whatsapp_chat.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

langchain-ai / langchain