[QUESTION] What is the recommended approach for many concurrent scrapers storing in a single database?

twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.

MIT License

15.75k stars 2.72k forks source link

[QUESTION] What is the recommended approach for many concurrent scrapers storing in a single database? #663

Closed zoink closed 4 years ago

zoink commented 4 years ago

I have many concurrent nodes/processes (e.g. Docker Swarm/K8s/SLURM) running twint -- how do I store my results in single database?

Is sqlite good enough for this? Thanks!

pielco11 commented 4 years ago

I don't know if sqlite is the right choice; I use it because it's light and simple to setup

You might want to save your output to csv or json, and after a scraping session, iterate over those files and save into a larger DB

zoink commented 4 years ago

Thanks for the response -- I think it might be useful for some tutorial/framework for running many concurrent scrapers at low-cost on one of the cloud platforms with K8S and then storing it for easy querying/analysis (BigQuery).

Will look into this later.