lucyparsons / OpenOversight

Police oversight and accountability through public data 👮
https://openoversight.com
GNU General Public License v3.0
237 stars 79 forks source link

Implement Flickr scraping #17

Open b-meson opened 8 years ago

b-meson commented 8 years ago

There are a lot of high-quality photos with very visible names and badge numbers are on Flickr. Some groups worth initial scrape

b-meson commented 8 years ago

possible link https://gist.github.com/ralphbean/9966896 worth exploring, has a python pip module https://github.com/alexis-mignon/python-flickr-api/wiki/Tutorial

b-meson commented 8 years ago

Currently there seems to be one way to do this in bulk (without a dedicated application or API). Open all the pages for a group or individual, scroll all the way down for the JS to render and then in the web inspector you can expand all of the HTML form and use a combination of awk / grep / sort / cut / uniq / to grab the relative path a picture: something like /photos/photoid. Then you can combine that and open flickr.com/photos/photoid and from that html you can find the src-id for the full resolution photo. I have been trying a combination of this plus curl and haven't had much success. Its likely we need a programmatic way of doing this (like an API) or use pythons robobrowser to get around some of these limitations

r4v5 commented 8 years ago

selenium might be the answer, as much as i hate it, if we can't actually get the API working.

b-meson commented 8 years ago

I will try a bit this week. Do you want to take a crack at it as well @r4v5 and @JoshuaOpolko ?

b-meson commented 8 years ago

I noticed that the credentials I posted in Slack channel are authentication and secret keys but we might actually be missing the API key (i believe that is separate) which might be why I was failing hard. I'm also more hopeful about selenium webdriver that I was a few days ago.

JoshuaOpolko commented 8 years ago

I've created a script to obtain photos and details via the Flickr API and used it to mine 2 CPD Flickr groups so far. It retrieves the highest resolution image for each entry as well as the title, description and similar metadata available through the API. The run time for retrieving an entire group photos/details (approx 2000-5000) is generally a couple of hours due to rate limits. Next I'll be looking at analyzing the title/descriptions to see if it's possible to obtain names or probable names in at least some cases automatically. The output files have been added to github under cpd_pictures. More groups will be added as needed.

JoshuaOpolko commented 8 years ago

I just want to add that the timing hasn't been tuned that much yet so it may be possible to run the collection somewhat more quickly.