Closed augtheo closed 1 year ago
Answer generated by a 🤖
Source: langchain/document_loaders/whatsapp_chat.py
To ignore deleted messages and omitted media, you can add a condition to check if the text of the message contains "This message was deleted" or "
def load(self) -> List[Document]:
"""Load documents."""
p = Path(self.file_path)
text_content = ""
with open(p, encoding="utf8") as f:
lines = f.readlines()
message_line_regex = r"""
\[?
(
\d{1,4}
[\/.]
\d{1,2}
[\/.]
\d{1,4}
,\s
\d{1,2}
:\d{2}
(?:
:\d{2}
)?
(?:[\s_](?:AM|PM))?
)
\]?
[\s-]*
([~\w\s]+)
[:]+
\s
(.+)
"""
for line in lines:
result = re.match(
message_line_regex, line.strip(), flags=re.VERBOSE | re.IGNORECASE
)
if result:
date, sender, text = result.groups()
if "This message was deleted" not in text and "<Media omitted>" not in text:
text_content += concatenate_rows(date, sender, text)
metadata = {"source": str(p)}
return [Document(page_content=text_content, metadata=metadata)]
This modification will ensure that deleted messages and omitted media are not processed and stored in the index.
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
System Info
langchain git+https://github.com/hwchase17/langchain@8392ca602c03d3ae660d05981154f17ee0ad438e Archcraft x86_64 Python 3.11.3
Who can help?
@eyurtsev @dev2049
Information
Related Components
Reproduction
6/29/23, 12:16 am - User 4: This message was deleted
and4/20/23, 9:42 am - User 3: <Media omitted>
.Expected behavior
We can avoid embedding these messages in the index.