tattle-made / whatsapp-scraper

2 stars 2 forks source link

Ensure that whatsapp messages from a group are stored in the database without duplication #24

Closed dennyabrain closed 3 years ago

dennyabrain commented 4 years ago

Description

We intend to join whatsapp groups and take their backup periodically and upload it to Google Drive. Anytime we run the scraper we would want only unique messages from a group to be stored in mongo.

Assumptions

We can assume that a group's name is unique. This can be ensured by contributors who submit their whatsapp group backup and tattle team members. So you could use the name of the group as the identifier when storing messages in the database.

Proposed Solution - Timestamp based approach

For the whatsapp group that you intend to add scraped messages to, fetch the last stored message's timestamp (say X) and only store those newly scraped messages whose timestamp is greater in time than the X.

dennyabrain commented 4 years ago

@tarunima review this please

surajsharma commented 4 years ago

i believe we can close this? unless we want to check each uploaded message for duplicates in the system

dennyabrain commented 4 years ago

@surajsharma just to clarify, can you link me to files where this deduplication is being handled?

tarunima commented 3 years ago

not relevant with new scraper in place.