Get older data - Githubissues

Thicool commented 8 years ago

How to get older hashtags for a given hashtag as the provided function only mines the 21 posts on the starting page?

panda0881 commented 8 years ago

8.get_media_from_tag(self, tag_name):

The input of this method is the tag name you are interested in. The output of this method are two lists. The first one contains all the media codes belong to top_post under this tag, while the second one is the full list of all the media codes under this tag.

Maybe you can look into this function to see if it solves your problem? The longer list should consist of all the medias under the specific hashtag

Best Regards, Hongming

panda0881 commented 8 years ago

Dear Jan,

It's nice hearing from you. If I don't get it wrong, you are trying to collect information about all the medias under a specific hashtag, right? If that is the case, the 21 posts you mentioned should be the content come with the html document and the others are loaded with javascript function. Typically, you have to use dynamic crawling to solve this problem. As you can see in my codes, you can try the 8th function:

8.get_media_from_tag(self, tag_name):

The input of this method is the tag name you are interested in. The output of this method are two lists. The first one contains all the media codes belong to top_post under this tag, while the second one is the full list of all the media codes under this tag.

If anything doesn't work or you still have question about that, feel free to contact me😁, I'm very happy to solve the problem with you.

Best Regards,

Hongming

On Tue, Oct 11, 2016 at 12:14 AM, Thicool notifications@github.com wrote:

Hey there, is there a way to get information about old tags? I am new to python and would like to get the captions of posts that are posted under a certain hashtag. the script works fine but gives me back 21 posts, which is exactly the numbers of posts that one finds on the explore page. Is there a way to tell the script to find the older posts and extrat their information as well?

thanks for your help, Jan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/panda0881/Instagram_data/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AM7XSLE0oVmWF4oY3rxwmj2M4s_GFlm0ks5qymRJgaJpZM4KSvMW .

Thicool commented 8 years ago

Thanks for help! That works fine now. I modified the function that gets all the media code, and it now gets me the media codes, captions and dates 👍

Thicool commented 8 years ago

Hey Hongming, i tried to run my code to get all data for a tag but it seems that it somehow only works until around 4500 posts are scraped. Same issue with the original get_media_from_tag script.

error.pdf

do u have a guess what is going wrong there?

Thanks so much for your help!

panda0881 commented 8 years ago

Hi Jan,

I looked into your problem. The error here is KeyError which indicates that there is something wrong with the data you get from the URL. There are several possible reasons for this problem. 1. The server may find out that you are a bot, 2. The server may deleted several pages, but the previous page still shows there are more. In my past experience, the second reason may be the most possible one. Anyway, I add a check to the program such that if there is a similar problem, it will let you know and keep running.

Thank you for letting me know.

Best Regards, Hongming

Thicool commented 8 years ago

Hey Hongming. Thank for your fast response. Your changes to the code helped me a lot in so far that when the problem occurs, the script stops without an error and the collected data can be used. However, for what i am trying to do, i propably need more data. Is there a way to skip the missing pages and continue afterwards instead of just stopping the whole program?

Thanks for your help!!

Regards, Jan

edit: It is not a tag specific problem, also other tags get the error after around 4500 tags...

panda0881 commented 8 years ago

You can try to build a loop to keep connect the server until you get the right response. Here is an example I used before in another program.

def request_until_succeed(url):
    response = s.get(url)
    success = False
    while success is False:
        try:
            if response['status_code'] == 200:
                success = True
        except:
            time.sleep(5)
            print("Retrying...")
    return response.read()

Maybe you can change the response['status_code'] == 200 into some code to check whether their is a 'media' or not

Best Regards, Hongming

Thicool commented 8 years ago

i looked at the last post that causes the error: except KeyError: print result

and it returned:

{u'status': u'fail', u'message': u'\u5f88\u62b1\u6b49\uff0c\u8bf7\u6c42\u6b21\u6570\u8fc7\u591a\u3002\u8bf7\u7a0d\u540e\u91cd\u8bd5\u3002'}

There is no information in this post to get. My problem is that there is also no 'end_cursor' so i can not skip it and just go to the next post with

self.collect_media_list(tag_name, result['media']['page_info']['end_cursor'])

As i need this data for my studys i would be very thankful if you can look at that problem and maybe provide a solution....

Thanks so much Jan

panda0881 commented 8 years ago

Hi Jan,

Sorry for the delay response, I just got up. I understand your problem here, but the thing is if there is no appropriate response from the server, you can't skip it. There is a possible to solve that problem. You may considering debugging the program to find out the response from the server. The response state can tell you why you can't get the appropriate response. And you can solve the problem based on the real problem here. For instance, if you got blocked by the server, it may mean that you need to add a delay to the program.

Btw, do you mind telling me which hashtag you are interested in? Maybe I can help finding out where is problem.

Best Regards, Hongming

Thicool commented 8 years ago

Hey Hongming, I am studying brand image perceptions so any global brand can be interesting for this study. I tried #mcdonalds to get the data. But the problem occurs on other hashtags at around the same point, so it is unlikely that the hashtag data itself has a problem i think. Thanks for your help, really appreciate it!!!! :)

panda0881 commented 8 years ago

Hi Jan,

I think I found a way to solve your problem. When I tested the program, the response from server is 429, which means that the bot has sent too many requests to the server. So I added a 0.5 second delay for each request. So far, the programs runs good and I can get more than 100000 medias.

But there is another problem, when you get too much loops(about 1000 recursion), you may meet the error: maximum recursion depth exceeded. This may depend on your computer. To solve that problem, you may considering change the recursion into loop structure.

Btw, I used to analyze brand image and data size over 10k may cause a problem in computation power.

Best Regards, Hongming

Thicool commented 8 years ago

Thanks for your patience and help. However, the actual code does not seem to work for me. After a small amount of data (sometime some hundrets, sometime around thousand) the script stops.

unbenannt

i have no idea what is going wrong now.

Best Regards, Jan

panda0881 commented 8 years ago

Hi Jan,

In my humble view, your problem here may be your bot is detected by the Instagram server. There are two ways to solve this problem. 1. you can change the time delay according to your own network situation and actual purpose. 2. change your IP address from time to time.

PS: the 0.5 seconds delay works fine for my situation, but you may need to change that according to your own situation.

For the second solution, you may want to check the following website: How to avoid HTTP error 429 (Too Many Requests) python

Best Regards, Hongming

Thicool commented 8 years ago

Hi Hongming,

i created a looping version with sleep and it worked out more or less fine. At a high number (ran this two times now) i got a type error at around 30k. I have now idea how this can happen: error if u have no idea either, can you send me the 100k #mcdonalds file you created maybe? I dont know how to get this information on any other way as there is no adequat script to find on this page. If it is a bot detection again, i now try to increase sleeping time again to 3 seconds.

edit: I tried to save the end_cursor to continue when there is a problem at that point. However, if i start with the last end_cursor, the program immediatly stops again. edit: It seems that these end_cursor strings change over time so this makes no sense

Thanks so much Jan

panda0881 commented 8 years ago

Hi Jan,

I was busy with my midterm exams, sorry about that. I tried to collect the 100k file for you. But the program finished collecting data at 53k. I have the data attached. I will try again later, if I get any success, I will let you know.

Btw, it is a list stored in JSON format

Best Regards, Hongming

panda0881 commented 8 years ago

Hi Jan,

I tried for one more time and the program stops at the same position, which is 53156. So my guess is that total number of medias under this hashtag is 53156 and all the others may be deleted or stored in some other ancient storage machine and can't be accessed easily.

Btw, when the program stops, there is no error. it just shows that there is no next page.

Best Regards, Hongming

Thicool commented 8 years ago

Hey Hongkong,

My biggest succes was 50k as Well. I am trying around vpn and ip Reset stuff but no success. But if i keep repeating this over the next month, i think the Data will be okay for what i am doing. Thanks for all ur help and if u have a breakthrough, let me know :)

Good luck on your exams and Cheers from Germany

Am 22.10.2016 10:22 vorm. schrieb "Hongming ZHANG" <notifications@github.com

:

Hi Jan,

I was busy with my midterm exams, sorry about that. I tried to collect the 100k file for you. But the program finished collecting data at 53k. I have the data attached. I will try again later, if I get any success, I will let you know.

Btw, it is a list stored in JSON format

Best Regards, Hongming

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/panda0881/Instagram_data/issues/1#issuecomment-255514751, or mute the thread https://github.com/notifications/unsubscribe-auth/AVm_4nn3wHVTVEKbIM2bnDLIEPk0_WmKks5q2ce-gaJpZM4KSvMW .

Thicool commented 7 years ago

Hi Hongming,

i have been doing pretty good on crawling instagram and my master thesis about "brands on instagram" is nearly done. My professor even wants me to do my PhD on this stream of research. Again big thanks for your help to archive this!!!!

However, since some days, accessing the instagram feed does not work anymore for me. I have tried different http librarys for python like requests, httplib2, urlb2 but i always get an 404 error when i try to get data from: "https://www.instagram.com/explore/tags/porsche/" or any other hashtag. However, other urls from instagram work finde, so i guess they changed something on the url of their feed. I am confused because i can access the url on my browser but python cannot find it. I dont want to bother you but do you have any idea of what they changed to that i can not access their data anymore?

Many thanks for your advice and all the best,

Jan

2016-10-22 12:29 GMT+02:00 Hongming ZHANG notifications@github.com:

Hi Jan,

I tried for one more time and the program stops at the same position, which is 53156. So my guess is that total number of medias under this hashtag is 53156 and all the others may be deleted or stored in some other ancient storage machine and can't be accessed easily.

Btw, when the program stops, there is no error. it just shows that there is no next page.

Best Regards, Hongming

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/panda0881/Instagram_data/issues/1#issuecomment-255520622, or mute the thread https://github.com/notifications/unsubscribe-auth/AVm_4mXWYT8lwtEB1xaz5vyF353SBFv-ks5q2eWPgaJpZM4KSvMW .

panda0881 commented 7 years ago

Hi Jan,

I just tried your problem, I can't get information from that page before I log in. but once I logged in, I can successfully get data from that page. maybe you can try the log in function first. If you still have any problem, let me know~

Hongming

Thicool commented 7 years ago

Hi, thanks for your answer.Thats what i figured out as well now, they changed the login so that you now need to login to see "recent uploads". So now i am stuck with the login function. I created a dummy account for crawling but somehow get this error:

M = InstagramSpider() M.login('userbrand10001', 'passwortbrand10001')

-->InvalidHeader: Header value 1 must be of type str or bytes, not <type 'int'>

Some month ago, i had no problem using the login function. any ideas? See attachment for full error message.

Thank you so much! Definitly need to mention you if the paper gets published :)

cheers jan

2017-03-21 5:35 GMT+01:00 Hongming ZHANG notifications@github.com:

Hi Jan,

I just tried your problem, I can't get information from that page before I log in. but once I logged in, I can successfully get data from that page. maybe you can try the log in function first. If you still have any problem, let me know~

Hongming

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/panda0881/Instagram_data/issues/1#issuecomment-287974732, or mute the thread https://github.com/notifications/unsubscribe-auth/AVm_4tTp_d6gg4jILre7iltQOdbzDDKrks5rn1OugaJpZM4KSvMW .

InvalidHeader Traceback (most recent call last)

in () 1 M = InstagramSpider() ----> 2 M.login('userbrand10001', 'passwortbrand10001') in login(self, username, password) 53 } 54 data = {'username': username, 'password': password} ---> 55 self.s.post('https://www.instagram.com/accounts/login/ajax/', data=data, headers=headers) 56 57 def get_user_data(self, name): C:\Users\Jankl\Anaconda2\lib\site-packages\requests\sessions.pyc in post(self, url, data, json, **kwargs) 533 """ 534 --> 535 return self.request('POST', url, data=data, json=json, **kwargs) 536 537 def put(self, url, data=None, **kwargs): C:\Users\Jankl\Anaconda2\lib\site-packages\requests\sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 472 hooks = hooks, 473 ) --> 474 prep = self.prepare_request(req) 475 476 proxies = proxies or {} C:\Users\Jankl\Anaconda2\lib\site-packages\requests\sessions.pyc in prepare_request(self, request) 405 auth=merge_setting(auth, self.auth), 406 cookies=merged_cookies, --> 407 hooks=merge_hooks(request.hooks, self.hooks), 408 ) 409 return p C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.pyc in prepare(self, method, url, headers, files, data, params, auth, cookies, hooks, json) 301 self.prepare_method(method) 302 self.prepare_url(url, params) --> 303 self.prepare_headers(headers) 304 self.prepare_cookies(cookies) 305 self.prepare_body(data, files, json) C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.pyc in prepare_headers(self, headers) 441 for header in headers.items(): 442 # Raise exception on invalid header value. --> 443 check_header_validity(header) 444 name, value = header 445 self.headers[to_native_string(name)] = value C:\Users\Jankl\Anaconda2\lib\site-packages\requests\utils.pyc in check_header_validity(header) 794 except TypeError: 795 raise InvalidHeader("Header value %s must be of type str or bytes, " --> 796 "not %s" % (value, type(value))) 797 798 InvalidHeader: Header value 1 must be of type str or bytes, not

panda0881 commented 7 years ago

Hi Jan,

I think I just fixed the problem. in the header file, the term of x-instagram-ajax should be '1' rather than 1. Thanks for the bug report haha.

You can pull it again and try it, if you have any other questions, let me know~

Good luck on your project, I'm glad that this little project helps you.

Hongming

Thicool commented 7 years ago

Login works finde now! I will implement it to the rest of my bot later and see if i can bring my crawler back to work :D Thank you so much!

2017-03-21 14:00 GMT+01:00 Hongming ZHANG notifications@github.com:

Hi Jan,

I think I just fixed the problem. in the header file, the term of x-instagram-ajax should be '1' rather than 1. Thanks for the bug report haha.

You can pull it again and try it, if you have any other questions, let me know~

Good luck on your project, I'm glad that this little project helps you.

Hongming

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/panda0881/Instagram_data/issues/1#issuecomment-288070658, or mute the thread https://github.com/notifications/unsubscribe-auth/AVm_4lA5jmPa2S9xdC-NIrvdh6-RBewmks5rn8nmgaJpZM4KSvMW .

Thicool commented 7 years ago

Hey Hongming,

i have another problem occuring with the functions that use cookies, here is an example error message:

Traceback (most recent call last): File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 450, in data = M.get_media_from_tag('droetker') File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 335, in get_media_from_tag self.collect_media_list(tag_name, data['media']['page_info']['end_cursor']) File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 313, in collect_media_list result = tmp_result.json() File "C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.py", line 866, in json return complexjson.loads(self.text, **kwargs) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson__init__.py", line 501, in loads return _default_decoder.decode(s) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py", line 370, in decode obj, end = self.raw_decode(s) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py", line 400, in raw_decode return self.scan_once(s, idx=_w(s, idx).end()) simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I changed the "x-instagram-ajax should be '1' rather than 1" things because the error that i described above appeared too. If you think this is a simple problem to solve i would be very happy if you can help me on this one. If it is difficult, i need to look for other solutions.

Thanks for your help!

btw: We have submitted our first paper to a marketing journal, if it will be published someday, i`ll let you know :)

Big Thanks and kind regards,

Jan

2017-03-21 14:40 GMT+01:00 Jan Klostermann klostermann.jp@gmail.com:

Login works finde now! I will implement it to the rest of my bot later and see if i can bring my crawler back to work :D Thank you so much!

2017-03-21 14:00 GMT+01:00 Hongming ZHANG notifications@github.com:

Hi Jan,

I think I just fixed the problem. in the header file, the term of x-instagram-ajax should be '1' rather than 1. Thanks for the bug report haha.

You can pull it again and try it, if you have any other questions, let me know~

Good luck on your project, I'm glad that this little project helps you.

Hongming

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/panda0881/Instagram_data/issues/1#issuecomment-288070658, or mute the thread https://github.com/notifications/unsubscribe-auth/AVm_4lA5jmPa2S9xdC-NIrvdh6-RBewmks5rn8nmgaJpZM4KSvMW .

Thicool commented 7 years ago

I found out that i get an <Response [405]> so no data was received in line 119:

if not end_cursor: data = 'q=ig_user(' + user_id + \ ')+%7B%0A++followed_by.first(10)+%7B%0A++++count%2C%0A++++page_info+%7B%0A++++++end_cursor%2C%0A+' \ '+++++has_next_page%0A++++%7D%2C%0A++++nodes+%7B%0A++++++id%2C%0A++++++is_verified%2C%0A++++++fol' \ 'lowed_by_viewer%2C%0A++++++requested_by_viewer%2C%0A++++++full_name%2C%0A++++++profile_pic_url%2' \ 'C%0A++++++username%0A++++%7D%0A++%7D%0A%7D%0A&ref=relationships%3A%3Afollow_list' result = self.s.post('https://www.instagram.com/query/', data=data, headers=headers)

2017-06-29 11:23 GMT+02:00 Jan Klostermann klostermann.jp@gmail.com:

Hey Hongming,

i have another problem occuring with the functions that use cookies, here is an example error message:

Traceback (most recent call last): File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 450, in data = M.get_media_from_tag('droetker') File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 335, in get_media_from_tag self.collect_media_list(tag_name, data['media']['page_info'][' end_cursor']) File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 313, in collect_media_list result = tmp_result.json() File "C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.py", line 866, in json return complexjson.loads(self.text, **kwargs) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson__init__.py", line 501, in loads return _default_decoder.decode(s) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py", line 370, in decode obj, end = self.raw_decode(s) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py", line 400, in raw_decode return self.scan_once(s, idx=_w(s, idx).end()) simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I changed the "x-instagram-ajax should be '1' rather than 1" things because the error that i described above appeared too. If you think this is a simple problem to solve i would be very happy if you can help me on this one. If it is difficult, i need to look for other solutions.

Thanks for your help!

btw: We have submitted our first paper to a marketing journal, if it will be published someday, i`ll let you know :)

Big Thanks and kind regards,

Jan

2017-03-21 14:40 GMT+01:00 Jan Klostermann klostermann.jp@gmail.com:

Login works finde now! I will implement it to the rest of my bot later and see if i can bring my crawler back to work :D Thank you so much!

2017-03-21 14:00 GMT+01:00 Hongming ZHANG notifications@github.com:

Hi Jan,

I think I just fixed the problem. in the header file, the term of x-instagram-ajax should be '1' rather than 1. Thanks for the bug report haha.

You can pull it again and try it, if you have any other questions, let me know~

Good luck on your project, I'm glad that this little project helps you.

Hongming

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/panda0881/Instagram_data/issues/1#issuecomment-288070658, or mute the thread https://github.com/notifications/unsubscribe-auth/AVm_4lA5jmPa2S9xdC-NIrvdh6-RBewmks5rn8nmgaJpZM4KSvMW .

panda0881 commented 7 years ago

Dear Jan,

Sorry for the late reply. I'm currently out of my office. Will look into the code tomorrow. I will see how can I help. As you know, Instagram may change their backend from time to time. So we may need to change our system from time to time haha.

Btw, congratulations on the paper^_^

Best regards, Hongming

Sent from my iPhone

On 29 Jun, 2017, at 5:23 pm, Thicool notifications@github.com wrote:

Hey Hongming,

i have another problem occuring with the functions that use cookies, here is an example error message:

Traceback (most recent call last): File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 450, in data = M.get_media_from_tag('droetker') File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 335, in get_media_from_tag self.collect_media_list(tag_name, data['media']['page_info']['end_cursor']) File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py", line 313, in collect_media_list result = tmp_result.json() File "C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.py", line 866, in json return complexjson.loads(self.text, **kwargs) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson__init__.py", line 501, in loads return _default_decoder.decode(s) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py", line 370, in decode obj, end = self.raw_decode(s) File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py", line 400, in raw_decode return self.scan_once(s, idx=_w(s, idx).end()) simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I changed the "x-instagram-ajax should be '1' rather than 1" things because the error that i described above appeared too. If you think this is a simple problem to solve i would be very happy if you can help me on this one. If it is difficult, i need to look for other solutions.

Thanks for your help!

btw: We have submitted our first paper to a marketing journal, if it will be published someday, i`ll let you know :)

Big Thanks and kind regards,

Jan

2017-03-21 14:40 GMT+01:00 Jan Klostermann klostermann.jp@gmail.com:

Login works finde now! I will implement it to the rest of my bot later and see if i can bring my crawler back to work :D Thank you so much!

2017-03-21 14:00 GMT+01:00 Hongming ZHANG notifications@github.com:

Hi Jan,

I think I just fixed the problem. in the header file, the term of x-instagram-ajax should be '1' rather than 1. Thanks for the bug report haha.

You can pull it again and try it, if you have any other questions, let me know~

Good luck on your project, I'm glad that this little project helps you.

Hongming

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/panda0881/Instagram_data/issues/1#issuecomment-288070658, or mute the thread https://github.com/notifications/unsubscribe-auth/AVm_4lA5jmPa2S9xdC-NIrvdh6-RBewmks5rn8nmgaJpZM4KSvMW .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

panda0881 / Instagram_data

Get older data #1