Closed dennyabrain closed 3 years ago
@tarunima has detailed the complete detailed requirements here Two problems that are slightly open ended and will require some creative (and maybe not perfect) solutions are the following
Re. 2- storing is the operative word. Ideally we should remove redundancy at the data collection stage itself. As opposed to deleting redundant posts after collecting.
@tarunima yes of course i don't see any need to process redundant messages any more than necessary, they will get weeded out as early in the pipeline as possible.
Whatsapp lets you export chat from a group via an 'export chat' feature.
This feature lets you backup 40,000 messages in total (10,000 if you choose to include media - images, videos) One of the options it provides you is export this data dump to your google drive.
The scope of this task is to create a scraper that can fetch this data from a google drive and parse it into messages. @tarunima built a proof of concept of this in python, which you can find here https://github.com/tattle-made/whatsapp-scraper/blob/master/examples/googleDrive_load.py