Closed dennyabrain closed 3 years ago
@tarunima review this please
i believe we can close this? unless we want to check each uploaded message for duplicates in the system
@surajsharma just to clarify, can you link me to files where this deduplication is being handled?
not relevant with new scraper in place.
Description
We intend to join whatsapp groups and take their backup periodically and upload it to Google Drive. Anytime we run the scraper we would want only unique messages from a group to be stored in mongo.
Assumptions
We can assume that a group's name is unique. This can be ensured by contributors who submit their whatsapp group backup and tattle team members. So you could use the name of the group as the identifier when storing messages in the database.
Proposed Solution - Timestamp based approach
For the whatsapp group that you intend to add scraped messages to, fetch the last stored message's timestamp (say X) and only store those newly scraped messages whose timestamp is greater in time than the X.