tattle-made / whatsapp-scraper

2 stars 2 forks source link

Ingest whatsapp data from google drive #2

Closed dennyabrain closed 3 years ago

dennyabrain commented 4 years ago

Whatsapp lets you export chat from a group via an 'export chat' feature.

Screen Shot 2020-06-18 at 11 49 35 AM

This feature lets you backup 40,000 messages in total (10,000 if you choose to include media - images, videos) One of the options it provides you is export this data dump to your google drive.

The scope of this task is to create a scraper that can fetch this data from a google drive and parse it into messages. @tarunima built a proof of concept of this in python, which you can find here https://github.com/tattle-made/whatsapp-scraper/blob/master/examples/googleDrive_load.py

dennyabrain commented 4 years ago

@tarunima has detailed the complete detailed requirements here Two problems that are slightly open ended and will require some creative (and maybe not perfect) solutions are the following

  1. Assigning a unique id to a group The folder created by whatsapp is named as "Whatsapp Chat with ", its possible that as the number of groups increase and the people using this feature increase, there will be conflicting group names. We need to find an acceptable solution to this (preferably not involving human effort)
  2. Avoiding storing duplicate messages.
tarunima commented 4 years ago

Re. 2- storing is the operative word. Ideally we should remove redundancy at the data collection stage itself. As opposed to deleting redundant posts after collecting.

surajsharma commented 4 years ago

@tarunima yes of course i don't see any need to process redundant messages any more than necessary, they will get weeded out as early in the pipeline as possible.