postaddictme / instagram-php-scraper

Get account information, photos, videos, stories and comments.
https://packagist.org/packages/raiym/instagram-php-scraper
MIT License
3.09k stars 799 forks source link

Can't retrieve user medias #325

Closed gthedev closed 6 years ago

gthedev commented 6 years ago

Using getMediasByUserId returns error, the returned body is: {"message": "forbidden", "status": "fail"}

Is there way to get around this?

ryantbrown commented 6 years ago

I am getting the same thing, just started happening an hour or so ago. Other methods are working.

rhcarlosweb commented 6 years ago

Same here, work only if login with credentials

fattony80 commented 6 years ago

Same here.

gsound commented 6 years ago

+1 Same here

eastygh commented 6 years ago

Hey, according last news about private data leak, Insta to close some anonymous api.

eastygh commented 6 years ago

oh, sorry. this news https://www.instagram.com/developer/changelog/ clarified the situation?

Epimetheus84 commented 6 years ago

+1 . But If use login with credentials, sometimes I get this exception 'InstagramScraper\Exception\InstagramAuthException' with message 'Something went wrong. Please report issue.'

zaivst commented 6 years ago

Fatal error: Uncaught exception 'InstagramScraper\Exception\InstagramException' with message 'Response code is 403. Body: message => forbidden; status => fail; Something went wrong. Please report issue.' in /InstagramScraper/Instagram.php:315 Stack trace: #0 /InstagramScraper/Instagram.php(272): InstagramScraper\Instagram->getMediasByUserId(4919194635, 3, '') #1

zaivst commented 6 years ago

https://www.instagram.com/vasiliizaikovskii/?__a=1 this works!

rhcarlosweb commented 6 years ago

@zaivst I tried to get the next page of medias but without success 😕

I think query_id, or query_hash is one of thinks to make this work again, but i'm not a developer just a layman 😔

myrs commented 6 years ago

I made a solution for this one, but in python using automated browser to retrieve cookies and new URL. Really don't know how PHP implementation would look like, but this are the steps to do:

  1. Get cookies with automated browser
  2. Make request with this cookies, and new URL: 'https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"<user_id>","first":<items_to_retrieve>,"after":"<end_cursor>"}' where is either blank or end_cursor from previous request - instagram web uses 12. I tested successfully with 20.

Disclaimer 1: no authorization needed! Disclaimer 2: actually I reused the same cookies several times and it worked. The expiry seams to be set in one year. But I don't know if Instagram will catch the usage of cookies from many different clients if hardcoded to this scapper!

Python implementation:

# ! Error handling is omitted for clarity
import requests
from selenium import webdriver

media_url = 'https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"%s","first":20,"after":"%s"}'
browser = webdriver.Chrome()

# first get https://instagram.com to obtain cookies
browser.get('https://instagram.com')
browser_cookies = browser.get_cookies()

# set a session with cookies
session = requests.Session()
for cookie in browser_cookies:
    c = {cookie['name']: cookie['value']}
    session.cookies.update(c)

# get response as JSON
# > using id '25025320' - profile of Instagram for this example
response = session.get(media_url % ('25025320', ''), verify=False).json()
myrs commented 6 years ago

https://www.instagram.com/vasiliizaikovskii/?__a=1 this works!

@zaivst this one works and always worked, but, unfortunately, only for first 12 records of the profile. It worked before to retrieve next chunk of media, adding max_id parameter, but now it is just ignored.

zaivst commented 6 years ago

this one works and always worked

All other features works too until 4 apr :)

carvalholuan commented 6 years ago

Same problem here.

myrs commented 6 years ago

@raiym @rhcarlosweb @gthedev hi! I really don't know PHP to help with this one, but maybe the quick hotfix would be:

  1. change the actual URL in https://github.com/postaddictme/instagram-php-scraper/blob/master/src/InstagramScraper/Endpoints.php: ACCOUNT_MEDIAS = 'https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first={count}&after={max_id}'; to: ACCOUNT_MEDIAS = https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"{user_id}","first":{count},"after":"{max_id}"}
  2. Send the following cookies with request. I just checked - the cookies I retrieved yesterday still work (one day now) and from different clients, without need to get the new ones before each request. So automated browser part might be omitted for now:
    [
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "rur",
        "path": "/",
        "secure": false,
        "value": "PRN"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_vw",
        "path": "/",
        "secure": false,
        "value": "1038"
    },
    {
        "domain": "www.instagram.com",
        "expiry": 1554672942.248612,
        "httpOnly": false,
        "name": "csrftoken",
        "path": "/",
        "secure": true,
        "value": "ObRXje2ByOUmAnxqPaoFsD0CHvBEK8dQ"
    },
    {
        "domain": "www.instagram.com",
        "expiry": 2153943342.248646,
        "httpOnly": false,
        "name": "mid",
        "path": "/",
        "secure": false,
        "value": "WsqLMgALAAFkkaMz9rbL568BCU5N"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_vh",
        "path": "/",
        "secure": false,
        "value": "532"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_pr",
        "path": "/",
        "secure": false,
        "value": "2.5"
    }
    ]

Maybe this is not final solution, but at least media queries will work (for some time 😅)

carvalholuan commented 6 years ago

@myrs thanks bro! i tryed the same here, and works! :D

zaivst commented 6 years ago

@myrs when i try this in browser - it works, and when I try to do this changes in scraper it returns 403 status

dionvogliqi commented 6 years ago

@myrs how to edit the cookies we send? I didnt understand the step-2. Thanks for your time

myrs commented 6 years ago

@dionii1 As i said, unfortunately, I'm not really familiar with PHP =S As far as I understand, you should send cookies in header for this request. Maybe this piece of code from https://github.com/postaddictme/instagram-php-scraper/blob/master/src/InstagramScraper/Instagram.php could be relevant to make necessary changes:

$mid = $cookies['mid'];
$csrfToken = $cookies['csrftoken'];
$headers = ['cookie' => "csrftoken=$csrfToken; mid=$mid;",
    'referer' => Endpoints::BASE_URL . '/',
    'x-csrftoken' => $csrfToken,
];
$response = Request::post(Endpoints::LOGIN_URL, $headers,
    ['username' => $this->sessionUsername, 'password' => $this->sessionPassword]);
myrs commented 6 years ago

@zaivst what changes have you made?

zaivst commented 6 years ago

@myrs I changed ACCOUNT_MEDIAS constant in Endpoints.php and Request::get() function in getMediasByUserId() returns 403 status. But if I try to use string which is returns by Endpoints::getAccountMediasJsonLink($id, $maxId) in browser - it returns correct response.

rhcarlosweb commented 6 years ago

@myrs this

{
    "domain": "www.instagram.com",
    "httpOnly": false,
    "name": "ig_pr",
    "path": "/",
    "secure": false,
    "value": "2.5"
}

You know what is it? With this i maded work recent medias without credentials

myrs commented 6 years ago

@zaivst as I understand, this is because cookies are not set. @zaivst @dionii1 could you check this one? One more time, I'm not familiar with PHP and structure of this project, buy I imagine here https://github.com/postaddictme/instagram-php-scraper/blob/master/src/InstagramScraper/Instagram.php#L209 (line 209) is where cookies are set and this part should be changed to the following logic: if there is no session, use default cookies (actually, this is what I'm doing in python):

private function generateHeaders($session)
    {
        $headers = []
        if ($session) {
            $cookies = '';
            foreach ($session as $key => $value) {
                $cookies .= "$key=$value; ";
            }
            $headers = [
                'cookie' => $cookies,
                'referer' => Endpoints::BASE_URL . '/',
                'x-csrftoken' => $session['csrftoken'],
            ];
        } else {
            $rur = "PRN";
            $ig_vw = "1038"
            $csrftoken = "ObRXje2ByOUmAnxqPaoFsD0CHvBEK8dQ"
            $mid = "WsqLMgALAAFkkaMz9rbL568BCU5N"
            $ig_vh = "532"
            $ig_pr = "2.5"

            $headers = ['cookie' => "rur=$rur; ig_vw=$ig_vw; csrftoken=$csrftoken; mid=$mid; ig_vh=$ig_vh; ig_pr=$ig_pr;",
                'referer' => Endpoints::BASE_URL . '/',
                'x-csrftoken' => $csrftoken,
            ];
        }
        if ($this->getUserAgent()) {
            $headers['user-agent'] = $this->getUserAgent();
        }
        return $headers;
    }

Also, this line should be changed to: https://github.com/postaddictme/instagram-php-scraper/blob/master/src/InstagramScraper/Instagram.php#L313 $response = Request::get(Endpoints::getAccountMediasJsonLink($id, $count, $maxId), $this->generateHeaders($this->userSession)); to include count parameter.

Disclaimer: none of this code was run and supposed to work fine 😅

myrs commented 6 years ago

@rhcarlosweb wow! Only this cookie is really needed? Sincerely I don't know what this are. Just was sending all cookies I get. And so, you have a working version?

rhcarlosweb commented 6 years ago

328 Need to test... But this solution works for me 😌

myrs commented 6 years ago

@rhcarlosweb wow!! This actually worked for me!! Seams to be some magic, but just using this cookie resolved a whole deal!! So, there is no need to switch to URL I provided before, but using only this cookies with new URL works fine too.

rhcarlosweb commented 6 years ago

Nice @myrs and thanks for the cookie value, because i don't know how to get this value haha

rhcarlosweb commented 6 years ago

@myrs Strange because i have test with a blank value of $this->userSession['ig_pr'] = ""; and works too.. 😕 confused haha

myrs commented 6 years ago

@ryantbrown 👌 So, just waiting for this pull request to be approved

myrs commented 6 years ago

Strange because i have test with a blank value of $this->userSession['ig_pr'] = ""; and works too..

M.. maybe Instagram is just waiting this cookie name, no matter the value. Because setting it to some random value, e.g. 42 works fine too!

But yes, when ig_pr not present, returns 403 code.

Nice user private data protection system, anyway 😅

ryantbrown commented 6 years ago

@myrs ya hopefully @rhcarlosweb's PR gets approved quickly. The only thing that is unclear is whether the csrftoken should be set if missing. It seems to work with an empty string but the code you provided suggests it could be set using to ObRXje2ByOUmAnxqPaoFsD0CHvBEK8dQ as well.

Either way I think its good to go.

ghost commented 6 years ago

@myrs thanks. it's worked for me.

carvalholuan commented 6 years ago

You guys can get the "csrf_token" on login page HTML with REGEX.

https://instagram.com/accounts/login/

I'm getting the csrf_token from this page and using on cookies.

Next on: "{"activity_counts":null,"config":{"csrf_token":"SDYkHgQQsFkO1bCPKDWh35HEaoSOV7rM","viewer":null},""

I can't send the code because i'm using C#

image

rhcarlosweb commented 6 years ago

@luanrox i think if added

$cookies = static::parseCookies($response->headers['Set-Cookie']);
$this->userSession['csrftoken'] = $cookies['csrftoken'];

like the others requests getMediasByTag it's gonna work

rhcarlosweb commented 6 years ago

Today the error back again =\ someone with the same issue?

rhcarlosweb commented 6 years ago

I have tested again with new userSession cookie and $this->userSession['ig_pr'] not work anymore, now needs to use a $this->userSession['sessionid'] and i think is only generate with auth =p

carvalholuan commented 6 years ago

Shit :( Stop works here too...

andrewyoo commented 6 years ago

Haven't cracked it yet, but here's what I know so far.

When logged in, cookie 'sessionid' is required. When not logged in, a new header is required: 'x-instagram-gis: 4b698621d4a2ef5913f90aec25475d04'

I don't know how x-instagram-gis is calculated, but it appears to be a encryption of the parameters. The x-instagram-gis is recalculated for each pagination request, but is the same for the same request. It looks to be some crypto hashing function definitely involving the variables parameter and who knows what else.

I've tried to look at the obfuscated js to see what kind of encryption they are doing, but I haven't found it. Maybe someone can help take a look as well.

Perhaps they are using the same encryption technique that changed query_id to query_hash? Does anyone know how that's encrypted? It is a 32 char output, so i tried to play with md5, but no go.

350d commented 6 years ago

@andrewyoo x-instagram-gis calculated with csrf_token, rhx_gis, window.navigator.userAgent and variables from API call. Here is my refactored hashing function:

function gishash(n,r,t){function e(n,r){var t=(65535&n)+(65535&r);return(n>>16)+(r>>16)+(t>>16)<<16|65535&t}function o(n,r,t,o,u,c){return e((f=e(e(r,n),e(o,c)))<<(a=u)|f>>>32-a,t);var f,a}function u(n,r,t,e,u,c,f){return o(r&t|~r&e,n,r,u,c,f)}function c(n,r,t,e,u,c,f){return o(r&e|t&~e,n,r,u,c,f)}function f(n,r,t,e,u,c,f){return o(r^t^e,n,r,u,c,f)}function a(n,r,t,e,u,c,f){return o(t^(r|~e),n,r,u,c,f)}function i(n,r){var t,o,i,h,g;n[r>>5]|=128<<r%32,n[14+(r+64>>>9<<4)]=r;var v=1732584193,d=-271733879,l=-1732584194,A=271733878;for(t=0;t<n.length;t+=16)o=v,i=d,h=l,g=A,d=a(d=a(d=a(d=a(d=f(d=f(d=f(d=f(d=c(d=c(d=c(d=c(d=u(d=u(d=u(d=u(d,l=u(l,A=u(A,v=u(v,d,l,A,n[t],7,-680876936),d,l,n[t+1],12,-389564586),v,d,n[t+2],17,606105819),A,v,n[t+3],22,-1044525330),l=u(l,A=u(A,v=u(v,d,l,A,n[t+4],7,-176418897),d,l,n[t+5],12,1200080426),v,d,n[t+6],17,-1473231341),A,v,n[t+7],22,-45705983),l=u(l,A=u(A,v=u(v,d,l,A,n[t+8],7,1770035416),d,l,n[t+9],12,-1958414417),v,d,n[t+10],17,-42063),A,v,n[t+11],22,-1990404162),l=u(l,A=u(A,v=u(v,d,l,A,n[t+12],7,1804603682),d,l,n[t+13],12,-40341101),v,d,n[t+14],17,-1502002290),A,v,n[t+15],22,1236535329),l=c(l,A=c(A,v=c(v,d,l,A,n[t+1],5,-165796510),d,l,n[t+6],9,-1069501632),v,d,n[t+11],14,643717713),A,v,n[t],20,-373897302),l=c(l,A=c(A,v=c(v,d,l,A,n[t+5],5,-701558691),d,l,n[t+10],9,38016083),v,d,n[t+15],14,-660478335),A,v,n[t+4],20,-405537848),l=c(l,A=c(A,v=c(v,d,l,A,n[t+9],5,568446438),d,l,n[t+14],9,-1019803690),v,d,n[t+3],14,-187363961),A,v,n[t+8],20,1163531501),l=c(l,A=c(A,v=c(v,d,l,A,n[t+13],5,-1444681467),d,l,n[t+2],9,-51403784),v,d,n[t+7],14,1735328473),A,v,n[t+12],20,-1926607734),l=f(l,A=f(A,v=f(v,d,l,A,n[t+5],4,-378558),d,l,n[t+8],11,-2022574463),v,d,n[t+11],16,1839030562),A,v,n[t+14],23,-35309556),l=f(l,A=f(A,v=f(v,d,l,A,n[t+1],4,-1530992060),d,l,n[t+4],11,1272893353),v,d,n[t+7],16,-155497632),A,v,n[t+10],23,-1094730640),l=f(l,A=f(A,v=f(v,d,l,A,n[t+13],4,681279174),d,l,n[t],11,-358537222),v,d,n[t+3],16,-722521979),A,v,n[t+6],23,76029189),l=f(l,A=f(A,v=f(v,d,l,A,n[t+9],4,-640364487),d,l,n[t+12],11,-421815835),v,d,n[t+15],16,530742520),A,v,n[t+2],23,-995338651),l=a(l,A=a(A,v=a(v,d,l,A,n[t],6,-198630844),d,l,n[t+7],10,1126891415),v,d,n[t+14],15,-1416354905),A,v,n[t+5],21,-57434055),l=a(l,A=a(A,v=a(v,d,l,A,n[t+12],6,1700485571),d,l,n[t+3],10,-1894986606),v,d,n[t+10],15,-1051523),A,v,n[t+1],21,-2054922799),l=a(l,A=a(A,v=a(v,d,l,A,n[t+8],6,1873313359),d,l,n[t+15],10,-30611744),v,d,n[t+6],15,-1560198380),A,v,n[t+13],21,1309151649),l=a(l,A=a(A,v=a(v,d,l,A,n[t+4],6,-145523070),d,l,n[t+11],10,-1120210379),v,d,n[t+2],15,718787259),A,v,n[t+9],21,-343485551),v=e(v,o),d=e(d,i),l=e(l,h),A=e(A,g);return[v,d,l,A]}function h(n){var r,t="",e=32*n.length;for(r=0;r<e;r+=8)t+=String.fromCharCode(n[r>>5]>>>r%32&255);return t}function g(n){var r,t=[];for(t[(n.length>>2)-1]=void 0,r=0;r<t.length;r+=1)t[r]=0;var e=8*n.length;for(r=0;r<e;r+=8)t[r>>5]|=(255&n.charCodeAt(r/8))<<r%32;return t}function v(n){var r,t,e="";for(t=0;t<n.length;t+=1)r=n.charCodeAt(t),e+="0123456789abcdef".charAt(r>>>4&15)+"0123456789abcdef".charAt(15&r);return e}function d(n){return unescape(encodeURIComponent(n))}function l(n){return h(i(g(r=d(n)),8*r.length));var r}return v(l(r+":"+t+":"+window.navigator.userAgent+":"+n))}

Call this function like this: gishash("{\"id\":\"5821462185\",\"first\":40,\"after\":\"\"}", rhx_gis, csrf_token). rhx_gis and csrf_token can be parsed from any embed page source (CORS available on this links);

I've tried to archive this via javascript but here is the problem: I can't set these custom headers due allow-origin limitation for custom headers on instagram side, but this is not a problem in php I guess.

andrewyoo commented 6 years ago

@350d, wow, that actually worked! How did you go about extracting (or building) that hashing function? Just curious for future endeavors. Thanks!

350d commented 6 years ago

@andrewyoo just simple debug in browser

footniko commented 6 years ago

Seems like mentioned solutions here are no longer working :(
Need to find out the way the cookie param is generated for non-authorized users.

350d commented 6 years ago

I've just realized that x-instagram-gis is just an md5 hash 😀

footniko commented 6 years ago

@350d have you tried to make some requests? Because x-instagram-gis is not enough now for non-authorized users. As i understand need valid cookie data which is generated on every request.

350d commented 6 years ago

@footniko you should generate this header for every request

footniko commented 6 years ago

Yes, but only this header is not enough...

var data = null;

var xhr = new XMLHttpRequest();
xhr.withCredentials = true;

xhr.addEventListener("readystatechange", function () {
  if (this.readyState === 4) {
    console.log(this.responseText);
  }
});

xhr.open("GET", "https://www.instagram.com/graphql/query/?query_hash=bfe6fc64e0775b47b311fc0398df88a9&variables=%7B%22user_id%22%3A%224502807%22%2C%22include_chaining%22%3Afalse%2C%22include_reel%22%3Afalse%2C%22include_suggested_users%22%3Afalse%2C%22include_logged_out_extras%22%3Atrue%7D");
xhr.setRequestHeader("x-instagram-gis", "2b92887dc6325064cf6294b95aa04586");
xhr.setRequestHeader("cache-control", "no-cache");

xhr.send(data);

Returns:

{
    "message": "forbidden",
    "status": "fail"
}
350d commented 6 years ago

add X-CSRFToken and X-Instagram-AJAX: 1 headers

knissophiliac commented 6 years ago

@350d so you mean following cookies are enough? x-instagram-gis : generate randomized md5 hash X-CSRFToken : csrftoken X-Instagram-AJAX: 1

350d commented 6 years ago

@knissophiliac these headers are not randomized, they are calculated from current csrf_token and rhx_gis and other vars.

knissophiliac commented 6 years ago

@350d ok. i read your comment above, but i couldn't find rhx_gis one in my responses.