snarfed / bridgy

đź“Ł Connects your web site to social media. Likes, retweets, mentions, cross-posting, and more...
https://brid.gy
Creative Commons Zero v1.0 Universal
703 stars 52 forks source link

switch instagram to scraping due to new permission policy :( #603

Closed snarfed closed 8 years ago

snarfed commented 8 years ago

Instagram is locking down their API and requiring all apps to go through a review process similar to facebook's. details in snarfed/granary#65.

they're mainly locking down /users/self/feed and /media/popular and sending photos outside of instagram, neither of which bridgy does, so i think we'll be ok, but no guarantees.

TODO for switching to scraping:

snarfed commented 8 years ago

the new set of oauth scopes aka permissions is on https://www.instagram.com/developer/authorization/ :

  • basic - to read a user’s profile info and media (granted by default)
  • public_content - to read any public profile info and media on a user’s behalf
  • follower_list - to read the list of followers and followed-by users
  • comments - to post and delete comments on a user’s behalf
  • relationships - to follow and unfollow accounts on a user’s behalf
  • likes - to like and unlike media on a user’s behalf
snarfed commented 8 years ago

i started on the review process, but stopped when i saw it requires a screencast. ugh.

i'll do that eventually. here's the rest of what i have written up so far:

https://www.instagram.com/developer/clients/580be8883446443d8216ebdf0462f3b8/review/

1. Description

Got a blog? Do you post your public Instagram photos on your blog? Bridgy notifies your blog posts when people like or comment on your photos on Instagram.

2. How does your app use the Instagram API?

Bridgy helps individual users share their own content with their own web sites. Specifically, when a user posts a photo on their own web site (by any means) as well as Instagram, Bridgy notifies their web site when people like that photo or comment on it inside Instagram. This requires the basic permission.

Bridgy only operates on public accounts. It does not support private accounts.

Bridgy also has a publish feature that integrates with users' web sites in the other direction. Users can post on their web site that they like an Instagram photo, or have a comment on it, and they can then use Bridgy to post that comment or like that photo inside Instagram. These require the likes permission, which Bridgy currently has, and the comments permission, which it doesn't.

3. Do you need additional permissions?

Permission: likes

Users can post on their web site that they like an Instagram photo. They can then use Bridgy to like that photo inside Instagram.

Permission: comments

Users can post on their web site a comment on an Instagram photo. They can then use Bridgy to post that comment on that photo inside Instagram.

snarfed commented 8 years ago

made the screencast: https://youtu.be/eGMNItivBdY

snarfed commented 8 years ago

...and submitted to instagram for approval. fingers crossed! https://www.instagram.com/developer/clients/580be8883446443d8216ebdf0462f3b8/edit/#permissions

snarfed commented 8 years ago

they denied us. :(

Invalid Use Case: The use case described in your submission notes, screencast and website is not a valid use case that we allow on our Platform. Please see our Permissions Review and valid use cases description (https://www.instagram.com/developer/review/) for more information.

well. that's a problem.

they also denied commenting and liking, which is a bit less surprising, and due to a technicality: we didn't describe our use case well. meh.

This permission (comments) does not support the use case you described in your submission notes, screencast and website. Please review Login Permissions (http://instagram.com/developer/authorization) for a comprehensive list of permissions and valid use cases. likes:

This permission (likes) does not support the use case you described in your submission notes, screencast and website. Please review Login Permissions (http://instagram.com/developer/authorization/) for a comprehensive list of permissions and valid use cases.

snarfed commented 8 years ago

next step: apply for oauth-dropins and see if i can get it approved. not holding my breath, but i'd like to find at least one app i can get approved, just to see how the process works all the way through.

snarfed commented 8 years ago

done. fingers crossed!

snarfed commented 8 years ago

oauth-dropins got rejected too. :/

Still in Development: Your app is still in development. Please resubmit only when your app is ready to go live and no longer in development. Invalid Use Case: The use case described in your submission notes, screencast and website is not a valid use case that we allow on our Platform.

snarfed commented 8 years ago

i'm running out of ideas. i may have to start scraping. :/

snarfed commented 8 years ago

i took a brief look at what it would take to switch to scraping. the good news is, it's doable. instagram profile and photo pages happily serve without being logged in, and the data is easily available in JSON that we already have code to extract and parse.

the bad news is, profile pages only include counts of comments and likes for each photo, not the actual data about them. we'd have to fetch the individual photo pages to get the data. annoying, but not too bad. we already do this for twitter and google+.

the more worrisome part is that comments and likes are paged, so fetching the photo only gets us the first 10 of each. hrmph. if it's the most recent 10, we'll be able to backfeed at least 10 comments and likes per photo per poll period (20m right now)...but i expect some people peak above that sometimes. hrmph.

kylewm commented 8 years ago

Iiiiiii'd give some serious thought to whether it's worth the effort. Because of https://github.com/aaronpk/OwnYourGram/issues/16 PESOS doesn't work for many people any more anyway, and it's very likely OYG will be cut off altogether (even if he rebrands it).

I'm curious what the situation with IFTTT/Zapier/etc. integration is... whether their channels will be shut off too.

snarfed commented 8 years ago

hrmph, true. point taken.

i still posse to IG manually, so i may still do it if only for myself. we'll see.

kylewm commented 8 years ago

well, if you do do it, I'll certainly continue to use it :P

snarfed commented 8 years ago

ok, this is implemented, naively. it has to do an HTTP fetch per picture, in serial, to get comments and likes. ideally, those would be parallelized, and also cache and check the counts like G+ now (and i think twitter) so it only does the fetches when there are new comments or likes.

petermolnar commented 8 years ago

How will this effect the scraping? http://digiday.com/platforms/instagram-feed-changing-algorithm/

snarfed commented 8 years ago

heh. good question! fortunately we'd scrape your profile page, not your feed, and profile pages probably won't be algorithmic.

snarfed commented 8 years ago

open question: how to do auth for Instagram users, ie prove that they own an account before signing up or deleting, without the API?

the only answers I've come up with so far are 1) no auth and 2) indieauth, and check that the same domain is in the Instagram profile...

...in which case we'd need snarfed/oauth-dropins#10 (indieauth support).

snarfed commented 8 years ago

we'll also need to port the cron job that updates profile pictures.

snarfed commented 8 years ago

starting a todo list in the description.

snarfed commented 8 years ago

looks like the mf2 handlers were ok after all. the ID_USERID is expected, happens now, and works ok. i got a 200 from https://api.instagram.com/v1/media/1209758400153852506_1103525 just now, which 404ed earlier. so maybe a transient instagram problem? seems unlikely, but possible.

i still have to port the mf2 handlers themselves from the api to scraping, but that's separate.

snarfed commented 8 years ago

wow. evidently the real problem is that the API returns incomplete data. eg https://api.instagram.com/v1/media/1209758400153852506_1103525 says there are 10 likes but only includes 4 of them. the embedded JSON data in the HTML, https://www.instagram.com/p/BDJ7Nr5Nxpa/ , includes all 10.

pretty clear. on to porting the handlers!

snarfed commented 8 years ago

delete is blocked on aaronpk/IndieAuth.com#113.

snarfed commented 8 years ago

current plan for deleting legacy API accounts is that we'll indieauth into their first web site in domain_urls, which means delete won't work for accounts without any web sites. they'll need to re-login (with indieauth) first. here are those accounts:

/instagram/adamdohm
/instagram/amohd2
/instagram/andresin87
/instagram/chellebb
/instagram/debbite
/instagram/dougmckown
/instagram/eddy.arnold
/instagram/espylaub
/instagram/fck_yeah_
/instagram/fermentationfan
/instagram/hendryque
/instagram/isapien
/instagram/jamieontiveros
/instagram/johnbenson
/instagram/mathewi
/instagram/mistermaumau
/instagram/nikolnieto
/instagram/njashanmal
/instagram/photofox
/instagram/realkoyuchan
/instagram/silveradepy
/instagram/srevo
/instagram/the_timweston
/instagram/tylergillies
/instagram/zlojkashtan
snarfed commented 8 years ago

ran this in remote_api_shell to remove publish from all instagram accounts:

for i in Instagram.query(Instagram.features == 'publish'):
  i.features.remove('publish')
  i.put()
snarfed commented 8 years ago

flipped the switch! all instagram accounts are now on scraping and using indieauth for login/delete. fingers crossed!

snarfed commented 8 years ago

looking good so far. tentatively closing. woo!

Johnathangalliano commented 8 years ago

@snarfed May I ask how you passed the "Still in Development: Your app is still in development. Please resubmit only when your app is ready to go live and no longer in development." part? I am trying to submit my app now and I get this error back. And I can't for the life of my understand what it means. Sorry to hijack your thread but you seem to be the only one that has faced this issue.

snarfed commented 8 years ago

@Johnathangalliano sounds like your app is still in sandbox mode? https://www.instagram.com/developer/sandbox/

i didn't actually get approved, so i don't have more specific advice, sorry. i switched to scraping their html instead. :/

rummykhan commented 7 years ago

i was also doing scrapping, and it was all going very well, but suddenly my all accounts started getting limit exceeded. even when i sign in. do u have a fix for this.. and did you monitor the rate limit on different end points? thanks

snarfed commented 7 years ago

@rummykhan if you're getting 429s, then yeah, instagram rate limits HTTP requests by IP address or subnet. i hit that at one point too. lots of details in #665 and https://groups.google.com/d/msg/google-appengine/rpendSIxJMo/_u4G6uXiBQAJ .

rummykhan commented 7 years ago

thanks @snarfed and yea i was getting response code 429, today i did some testing and what i found is here.. maybe it'll help somebody.

Instagram Scrapping WORKAROUND

Tests:

2.  Get Posts of a user

    Test # 1 (Instgram Form Auth - Account 1)
    ------------------------------
        Login Status = Success

        Minutes     = 2:42
        Seconds     = 162
        Requests    = 354

        After this got response code 429 (Limit Exceeded)

    Test # 2 (Instgram Form Auth - Account 2)
    ------------------------------
        Login Status = Success

        Minutes     = 3:11
        Seconds     = 191
        Requests    = 400

        After this got response code 429 (Limit Exceeded)

    Test # 3 (Instgram Form Auth - Account 3)
    ------------------------------
        Login Status = Fail (Asked for email/phone verification)

        Minutes     = 3:13
        Seconds     = 182
        Requests    = 393

        After this got response code 429 (Limit Exceeded)

        Observation
        -----------
        1. We can get the user posts without being logged in.

    Test # 4 (No Auth - Time Delay 1 Second)
    ------------------------------
        Minutes     = 173
        Seconds     = 10438
        Requests    = 7051

        State: Stopped intentionally

Key Observation

  1. Requests Counts are ip based (which previously i thought are user based.)

Solution


  1. Use Proxies to avoid rate limiting. (Change the proxy as you receive 429)
  2. To Enhance speed Use Python multiprocessing with proxy chaining.
jgozal commented 7 years ago

If I may ask, about how many requests were you guys making per hour before hitting that limit?

snarfed commented 7 years ago

sure! details on request volume above and in https://groups.google.com/d/msg/google-appengine/rpendSIxJMo/_u4G6uXiBQAJ :

jgozal commented 7 years ago

Thanks @snarfed . I'm making ~.7qps non-stop for the whole day and haven't gotten any 429s (its only been 3-4 days). Do you think I should be concerned about getting them in the future at that rate?

snarfed commented 7 years ago

based on this data, maybe yes, within weeks. good luck though!

harshdamaniahd commented 6 years ago

Do I have to go through app approval process even for fetching photos from my Instagram account?

snarfed commented 6 years ago

@harshdamaniahd to write your own Instagram app? yes, and good luck even then.

harshdamaniahd commented 6 years ago

but here i see this : which means for non business account , we are redirected to old developer site image