notslang / instagram-screen-scrape

scrape public instagram data w/out API access
https://npmjs.com/package/instagram-screen-scrape
GNU General Public License v3.0
162 stars 38 forks source link

Automatic bulk scrape from list of values (eg MediaID's) #21

Open yarnball opened 7 years ago

yarnball commented 7 years ago

Hi,

Great work on this. So I've got a .txt file with a list of mediaID's I'd like to scrape the comments from.

However, I can only do them one-at-a time in your script.

I don't know CoffeeScript. How is this possible in your repo?

I tried with a BashScript- however there are often 404 errors on the comment scrapes. It often works if I "re-attempt" the scrape. Is there a "proper" way to do this?

Here's a copy of my Bash file

while read filename;
do instagram-screen-scrape comments --post "$filename" > "$filename.json";
done < list.txt
notslang commented 7 years ago

How many ids are you looking to scrape? If it's just a few hundred then I'd use the exit code of the instagram-screen-scrape comments command to retry, up to a maximum number of times.

If it's millions, then I'd put together a set of workers in JS & use RabbitMQ for task distribution / retries. The command line isn't that efficient for scraping (it requires starting up a new node process for each scrape and can't reuse the http connection between them). The CLI is just there because it's quick to setup for little tasks.