kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.29k stars 616 forks source link

Help with Error: <class 'TypeError'>. 'NoneType' object is not subscriptable for some posts #172

Closed joebah-joe closed 3 years ago

joebah-joe commented 3 years ago

Hi - been running the facebook-scraper for a while and noticed that for about 20% of the posts, I would catch this error:

Error: <class 'TypeError'>. 'NoneType' object is not subscriptable

I understand that it's because the scraper is trying to index an object of type None (i.e: the object has no value). However, I couldn't tell the difference between the post that generates this error and the post that don't - to the point that it almost seems random. It also seems like some page has a lot more occurrence of these errors than others.

I have tried changing the number of page and word count from 1 to 8,000. The ones that work continues to work. The ones that don't don't.

Does anyone have the same issue? Does it have something to do with the python version or python environment that I'm using? (I am python 3.6) Or am I just making too many requests? (I pull the data from the page every 30 mins). If so, how were you able to fix? Thanks!

kevinzg commented 3 years ago

I need more information, can you share the traceback?

joebah-joe commented 3 years ago

I need more information, can you share the traceback?

If I run from command line (see below), the 'text' and 'post_text' field would jut be left as blank in the generated .csv file.

facebook-scraper --filename nintendo_page_posts.csv --pages 2 longtunman

However, if I call the scraper from my own code and catch the error with traceback, it looks like this:

Error: . 'NoneType' object is not subscriptable, line:29

longtunman https://facebook.com/longtunman/posts/1020584865140788 Error: . 'NoneType' object is not subscriptable, line:29 Full Traceback: Traceback (most recent call last): File "/home/myscraper/mysite/facebookparser_app.py", line 29, in extract_facebook_data output = output + "" + (post['text'][:word_count]) + "\n" TypeError: 'NoneType' object is not subscriptable

Unfortunately, this doesn't trace back to the error generated from your code, only form my code when I call the post function

As I mentioned, this happens on some posts. Other posts works fine.

_EDIT - TLDR: basically, the post text function is giving me back a NoneType object when it is trying to scrape a problematic post, which causes the not subscriptable error when I try to use it. I can skip these posts for now so my code can continue to run and not bug out, but wanted to make sure that returning NonTYpe object from post text command should be expected for certain FB posts (though I have no idea what would trigger it - all the posts look the same to me...), and ideally if there's some way to still extract the text from these problematic posts...

kevinzg commented 3 years ago

I looked at those posts, the one that doesn't have text is 1022240714975203 Here's the HTML for that post: 1022240714975203.txt It's a video but it does have some text (behind a see more link), we can use this case to improve the text extraction.

By the way, I noticed that that page (longtunman) doesn't work with the latest version (where the starting URL doesn't end with /posts/, #170), so you might not want to upgrade yet.

joebah-joe commented 3 years ago

Oh I see... so the scraper is unable to get text if there are things like videos, and when that happens it returns a NoneType object?

Actually, I was unable to get text for the following recent posts too:

https://facebook.com/longtunman/posts/1020584865140788

https://facebook.com/longtunman/videos/129094235798751

https://facebook.com/longtunman/posts/1022861631579778

https://facebook.com/longtunman/posts/1022822748250333

https://facebook.com/longtunman/posts/1022793828253225

You were able to extract text for those? I was only able to get text for this post:

https://facebook.com/longtunman/posts/1022746714924603

By the way, I noticed that that page (longtunman) doesn't work with the latest version (where the starting URL doesn't end with /posts/, #170), so you might not want to upgrade yet.

EDIT:

Ah, good to know. I installed the scraper on pythonanywhere server less than a week ago and running my scripts off the server. I guess it's version 0.2.19

I just ran the scraper off my local desktop and it seems to extract these posts just fine. Could I have installed scraper wrong on pythonanywhere, and that's why I am having all these issues? I remembered there was a lot of compatibility issues and I sort of fumbled around until I got it to install...

kevinzg commented 3 years ago

so the scraper is unable to get text if there are things like videos, and when that happens it returns a NoneType object?

Yeah, it's either because of the video or the see more link.

You were able to extract text for those? I was only able to get text for this post:

Yes, the only one with out text was the one I mentioned on my previous comment.

I installed the scraper on pythonanywhere server less than a week ago

The version I'm talking about was released yesterday (v0.2.20). To check your version you can run pip freeze.

I ran the older version that I installed locally, and it sems to extract these page just fine.

If you can find out what version starts to fail that would be really useful.

joebah-joe commented 3 years ago

If you can find out what version starts to fail that would be really useful.

Okay, here's my PIP freeze output from pythonanywhere bash console. As mentioned earlier I had a lot of problem installing initially. Perhaps I got some installation stuff wrong...

pip freeze output.txt

kevinzg commented 3 years ago

There should be a facebook-scraper==<version> line there, but there isn't, maybe it is in a different virtual env?

joebah-joe commented 3 years ago

There should be a facebook-scraper==<version> line there, but there isn't, may it is in a different virtual env?

Ah so it turned out I installed facebook scraper in the general environment and not virtual environment. noob mistake. This makes it difficult to find the facebook-scraper.

Should I start over and install the scraper in the virtual environment? I remember having the most horrible time dealing with urllib3 version and force install stuff to make it work...

Okay, I created a virtual environment. Install version 0.2.19 and this was the CSV generated when I ran the command line for longtunman. It seems like I am still missing a lot of the text. I attached the CSV and the pip freeze for your reference. longtunman_page_posts.zip pip freeze output.txt

joebah-joe commented 3 years ago

There should be a facebook-scraper==<version> line there, but there isn't, may it is in a different virtual env?

Sorry - I kept editing my post. H pip freeze output.txt longtunman_page_posts.zip

ere's the new pip freeze output and the CSV file generated from the command line in the pythonanywhere virtual environment I just set up and installed version 0.2.19. Still having issues even with 0.2.19 with the missing text field for most of the posts...

neon-ninja commented 3 years ago

I just ran the scraper off my local desktop and it seems to extract these posts just fine

It sounds to me like this problem is localised to pythonanywhere. Perhaps as a cloud platform, facebook is serving it different HTML than you would otherwise get running locally. I created a free account on pythonanywhere, and I get the same problem as you when running it there, even though it runs fine for me locally too. When enabling debug logging, I get these errors in the console on pythonanywhere:

Exception while running extract_text: AttributeError("'NoneType' object has no attribute 'find'")

Using this code:

from facebook_scraper import get_posts, enable_logging
import logging
enable_logging(logging.DEBUG)
posts = list(get_posts("longtunman"))
print(f"{len(posts)} posts, {len([post for post in posts if not post['text']])} missing text")

So, it looks like the scraper hits the "more", url, then isn't able to find story_body_container in the resulting HTML. Here's a snippet of the resulting HTML:

<div class="hidden_elem"><code id="u_0_y_xn"><!-- <div id="m_story_permalink_view" data-sigil="m-story-view"><div class="_3f50"><div class="_5rgr async_like" data-store="&#123;&quot;linkdata&quot;:&quot;mf_story_key.1022822748250333:top_level_post_id.1022822748250333:tl_objid.1022822748250333:content_owner_id_new.113397052526245:throwback_story_fbid.1022822748250333:page_id.113397052526245:photo_id.1022822441583697:story_location.9:story_attachment_style.photo:tds_flgs.3:ott.AX9zWDJ7Y6F6r0Rn&quot;,&quot;share_id&quot;:1022822748250333,&quot;feedback_target&quot;:1022822748250333,&quot;feedback_source&quot;:8,&quot;action_source&quot;:2,&quot;actor_id&quot;:100022709408081&#125;" data-xt="2.mf_story_key.1022822748250333:top_level_post_id.1022822748250333:tl_objid.1022822748250333:content_owner_id_new.113397052526245:throwback_story_fbid.1022822748250333:page_id.113397052526245:photo_id.1022822441583697:story_location.9:story_attachment_style.photo:tds_flgs.3:ott.AX9zWDJ7Y6F6r0Rn" data-xt-vimp="&#123;&quot;pixel_in_percentage&quot;:0,&quot;duration_in_ms&quot;:1,&quot;subsequent_gap_in_ms&quot;:60000,&quot;log_initial_nonviewable&quot;:false,&quot;should_batch&quot;:true,&quot;require_horizontally_onscreen&quot;:false&#125;" data-ft="&#123;&quot;mf_story_key&quot;:&quot;1022822748250333&quot;,&quot;top_level_post_id&quot;:&quot;1022822748250333&quot;,&quot;tl_objid&quot;:&quot;1022822748250333&quot;,&quot;content_owner_id_new&quot;:&quot;113397052526245&quot;,&quot;throwback_story_fbid&quot;:&quot;1022822748250333&quot;,&quot;page_id&quot;:&quot;113397052526245&quot;,&quot;photo_id&quot;:&quot;1022822441583697&quot;,&quot;story_location&quot;:9,&quot;story_attachment_style&quot;:&quot;photo&quot;,&quot;tds_flgs&quot;:3,&quot;ott&quot;:&quot;AX9zWDJ7Y6F6r0Rn&quot;,&quot;tn&quot;:&quot;-R&quot;&#125;" id="u_0_s_CW" data-sigil="story-div story-popup-metadata story-popup-metadata feed-ufi-metadata"><div class="story_body_container"><header class="_7om2 _1o88 _77kd _5qc1"><div class="_5s61 _2pii _5i2i _52wc"><div class="_5xu4"><div class="_67lm _77kc" data-gt="&#123;&quot;tn&quot;:&quot;~&quot;&#125;" 

story_body_container is in there, it's just commented out. Here's a simple reprex showing this behaviour of requests_html:

from requests_html import HTML
h = HTML(html='<body><!-- <div class="story_body_container"></div> --></body>')
element = h.find('.story_body_container', first=True)
print(element) # Returns None

h.html = h.html.replace('<!--', '').replace('-->', '') seems to fix it - here's a PR https://github.com/kevinzg/facebook-scraper/pull/177

kevinzg commented 3 years ago

Thanks for debugging it, @neon-ninja! Hopefully your fix will work.

@joebah-joe forget what I said about the latest version not working, it only happens when the page limit is very low.

joebah-joe commented 3 years ago

h.html = h.html.replace('<!--', '').replace('-->', '') seems to fix it - here's a PR #177

Great stuff - so it's pythonanywhere specific issue!

Dumb question - where do I put this h.html.replace line to fix this isssue?

neon-ninja commented 3 years ago

h.html = h.html.replace('<!--', '').replace('-->', '') seems to fix it - here's a PR #177

Great stuff - so it's pythonanywhere specific issue!

Dumb question - where do I put this h.html.replace line to fix this isssue?

That PR is merged now, so just update to version 0.2.21

joebah-joe commented 3 years ago

Oh I just saw the latest version. Will update as soon as I get to my terminal in a few hours and will report back on whether it fixed for me or not.

Thanks @neon-ninja and @kevinzg - really appreciate all your help on this!

joebah-joe commented 3 years ago

@neon-ninja and @kevinzg - No luck for me with 0.2.21. This is the error I get from CLI on pythonanywhere:

(myvirtualenv) 04:27 ~/mysite $ facebook-scraper --filename nintendo_page_posts.csv --pages 5 longtunman
Couldn't get any posts.

Actually, this happens to all the FB pages now, not just Longtunman. And this is my PIP Freeze from the virtual environment I'm running the script/CLI:

appdirs==1.4.4 beautifulsoup4==4.9.3 bs4==0.0.1 certifi==2020.12.5 chardet==4.0.0 click==7.1.2 cssselect==1.1.0 dateparser==1.0.0 facebook-scraper==0.2.21 fake-useragent==0.1.11 Flask==1.1.1 idna==2.10 itsdangerous==1.1.0 Jinja2==2.11.3 lxml==4.6.2 MarkupSafe==1.1.1 parse==1.19.0 pyee==8.1.0 pyppeteer==0.2.5 pyquery==1.4.3 python-dateutil==2.8.1 pytz==2021.1 regex==2020.11.13 requests==2.25.1 requests-html==0.10.0 six==1.15.0 soupsieve==2.2 tqdm==4.59.0 tzlocal==2.1 urllib3==1.26.3 w3lib==1.22.0 websockets==8.1 Werkzeug==1.0.1

But I presume that you tested it out in pythonanywhere? I am wondering what I did wrong. I just pip uninstalled facebook-scraper==0.2.19 and then did a pip install facebook-scraper==0.2.21 in the virtual and ran the CLI.

neon-ninja commented 3 years ago

@joebah-joe sooner or later, facebook starts insisting that you login. Try feed your cookies in as per https://github.com/kevinzg/facebook-scraper/issues/28#issuecomment-793066983

joebah-joe commented 3 years ago

@joebah-joe sooner or later, facebook starts insisting that you login. Try feed your cookies in as per #28 (comment)

Okay, I'll try that. So the problem I'm having with 0.2.21 is because I'm not logged in or feeding the cookies,txt, Is that why I'm getting nothing for any of the pages?

balazssandor commented 3 years ago

@neon-ninja @kevinzg The issue persists on a page like dezsoandraskonyvei, where the URL https://www.facebook.com/dezsoandraskonyvei/posts doesn't exist image

We are getting back 0 results from there with version 0.2.21

>>> list(facebook_scraper.get_posts('dezsoandraskonyvei', pages=3))
[]
joebah-joe commented 3 years ago

@neon-ninja Well, I tried 0.2.21 again and passed in the cookies.txt (netscape) from my facebook page as instructed. No luck, just not getting any posts back at all from any page, be it longtunman or nintendo or whatever,

Just for fun, I went back to 0.2.20 with the cookies, and it works (though obviously longtunman still has issues with the NoneType). So there doesn't seem to be anything wrong with my facebook cookies at the very least.

Could you test 0.2.21 in your pythonanywhere environment again? I'm sure we're getting really close to fixing this but may be I'm just missing something... really appreciate your help with this so far.

kevinzg commented 3 years ago

@balazssandor the problem with that page is that the mobile version https://m.facebook.com/dezsoandraskonyvei doesn't list any posts (with or without the /posts suffix). The mbasic and touch subdomains don't have any posts either, so it seems it can only be scraped from the www one, which the scraper doesn't support.

@joebah-joe I tried with 0.2.21 and it did scrape some posts. I didn't try with a cookie file as I don't have an account so maybe it's that. Consider that Facebook can serve different content based on your IP, your usage, the date, A/B testing, etc. so the best might be to debug it yourself.

Also, if you are logging-in from your local computer, and using those cookies on your server (with a different IP), Facebook might flag it as suspicious activity?

joebah-joe commented 3 years ago

@joebah-joe I tried with 0.2.21 and it did scrape some posts. I didn't try with a cookie file as I don't have an account so maybe it's that. Consider that Facebook can serve different content based on your IP, your usage, the date, A/B testing, etc. so the best might be to debug it yourself.

Also, if you are logging-in from your local computer, and using those cookies on your server (with a different IP), Facebook might flag it as suspicious activity?

I see. I guess at the end of the day it's the pythonanywhere environment that's creating all sorts of problems for me as others don't seem to be getting it, and my local desktop version is running fine. This is a shame, I quite liked using their interface.

One last question then - I can use 0.2.21 without cookies? so .get_posts('longtunman', pages=10) works too?

kevinzg commented 3 years ago

I see. I guess at the end of the day it's the pythonanywhere environment that's creating all sorts of problems for me as others don't seem to be getting it, and my local desktop version is running fine. This is a shame, I quite liked using their interface.

Well, if it works on your desktop and not on the server, and everything else is the same, that's most likely the reason. Of course it's not really pythonanywhere's fault, it could happen with any cloud provider. There are workarounds that you might want to consider like rotating proxies, accounts, user agents, throttling the requests, etc.

One last question then - I can use 0.2.21 without cookies? so .get_posts('longtunman', pages=10) works too?

Yes, cookies are optional.

neon-ninja commented 3 years ago

@joebah-joe on PythonAnywhere, 0.2.21 works fine for me with cookies. Without cookies, I get nothing, and facebook says "You must log in first". Perhaps your cookies aren't working for some reason. Check that your cookies.txt looks something like:

.facebook.com   false   /   true    1678314119  datr    REDACTED
.facebook.com   false   /   true    0   m_pixel_ratio   1
.facebook.com   false   /   true    1623018122  fr  REDACTED
.facebook.com   false   /   true    1678314126  sb  REDACTED
.facebook.com   false   /   true    1646778124  c_user  REDACTED
.facebook.com   false   /   true    1646778124  xs  REDACTED
.facebook.com   false   /   true    1615846927  wd  1284x895

@kevinzg @balazssandor dezsoandraskonyvei only shows posts if logged in.

joebah-joe commented 3 years ago

on PythonAnywhere, 0.2.21 works fine for me with cookies. Without cookies, I get nothing, and facebook says "You must log in first". Perhaps your cookies aren't working for some reason. Check that your cookies.txt looks something like:

@neon-ninja Okay, so my cookies don't look exactly like that. So I created a new one and oddly enough it managed to pull posts once or twice, but then it stopped doing it.

However, the few times that it successfully pulled data, Longtunman still give me NoneType for some of the posts.

Do you mind sharing with me the script you used to do your test on pythonanywhere? May be if I mess around with that and get to the level where you are at, I might be able to figure out what I've done wrong. Thanks!

neon-ninja commented 3 years ago

@joebah-joe sure:

#!/usr/bin/env python3
from facebook_scraper import get_posts, enable_logging
import logging
enable_logging(logging.DEBUG)
posts = list(get_posts("longtunman", cookies="cookies.txt"))
print(f"{len(posts)} posts, {len([post for post in posts if not post['text']])} missing text")
joebah-joe commented 3 years ago

Thank you very much, I will play around with this to see if I can get the same result as you.

joebah-joe commented 3 years ago

@neon-ninja @kevinzg Okay, so I was able to see all the post in Longtunman using the debug code after passing in the cookies. Finally! Now I know the latest problem is with my portion of python code that calls the scraper.

I built this function this in my code:

def extract_facebook_data(page_name, page_num):

text_post =""

output = ("<PAGE>\n\n<LABEL>" + page_name + "</LABEL>\n\n")
from facebook_scraper import get_posts
for post in get_posts(page_name, pages=page_num, cookies="mysite/cookies.txt"):
    output = output + "<POST>\n"
    output = output + "<TEXT>" + (post['text']) + "</TEXT>\n"
    output = output + "<IMG>" + str(post['image']) + "</IMG>\n"
    output = output + "</POST>\n\n"

output = output + str("</PAGE>\n\n")
return output 

And that's where the trouble start, and I get Nonetype from post['text'] and everything blow up.

Have I called called the scraper wrong? I am so close to fixing this problem I can smell it!

neon-ninja commented 3 years ago

@joebah-joe It's possible for post['text'] to be None as sometimes people post images or videos with no text attached. So you need to handle that case. You could handle it the same way you handled post['image'] being None - coerce it to a string with the str function - so str(post['text']). Alternatively, if you want to put something else instead of the string "None", you could do something like post['text'] or 'no text found'

joebah-joe commented 3 years ago

@neon-ninja Hi - so after some debugging work, I discovered that the get_posts function basically returned nothing.

I then went back to your code with the list function, and checked the variable that was returned. It's a list function with a length of zero.

code:

posts = list(get_posts("longtunman", pages=10, cookies="cookies.txt")) print(posts) print(type(posts)) print(len(posts))

output: [] <class 'list'> 0

However, turning the debugger on (see attached debuggeroutput.txt file), I was able to see that all the posts exist and could have been extracted.

It seems to be a similar issue to 'dezsoandraskonyvei' page but this happens to all the pages I tried and not just 'longtunman'

debuggeroutput.txt

However, when I removed the cookies.txt from the get_posts function, the print(len(posts)) shows 32 items. But when I looked at that debugger log, it was getting 'raw posts' in literally every pages - which is to be expected as there was no cookies.

Has the get_posts function somehow bugged out when I pass the cookies as argument? so with cookies it can extract all the data but can't retrieve it. Without cookies, it can retrieve it but get raw post. I'm not very good at looking through the debug log so not sure what to look for. Thanks!

neon-ninja commented 3 years ago

Hi @joebah-joe - there must be something specific to your cookies file that is inducing facebook to serve you different HTML, as this works fine for me on PythonAnywhere with cookies. Unfortunately there's no HTML in your debug output. Try add logger.debug(self.response.html) in the page_iterators.py file around line 80 (https://github.com/kevinzg/facebook-scraper/blob/e2ed95d7a7bc5957fbf6c5af4f21fcc0b86d5a94/facebook_scraper/page_iterators.py#L80). You'll need to git clone the repository in your PythonAnywhere bash terminal to do this.

joebah-joe commented 3 years ago

@neon-ninja Ah ha - you may be right and that would explain why get_post is not pulling anything for me but the scraper is working perfectly fine.

I'm a bit new at this but I can't just add that logger line in /home/user/.virtualenvs/myvirtualenv/lib/python3.8/site-packages/page_iterators.py?

Also, I did notice that my cookies is different format from yours. Yours look like this according to your post:

.facebook.com false / true 1678314119 datr REDACTED .facebook.com false / true 0 m_pixel_ratio 1 .facebook.com false / true 1623018122 fr REDACTED .facebook.com false / true 1678314126 sb REDACTED .facebook.com false / true 1646778124 c_user REDACTED .facebook.com false / true 1646778124 xs REDACTED .facebook.com false / true 1615846927 wd 1284x895

Mine look like this:

.facebook.com false / true 0 datr REDACTED .facebook.com false / true 0 wd REDACTED .facebook.com false / true 0 sb REDACTED .facebook.com false / true 0 c_user REDACTED .facebook.com false / true 0 xs REDACTED .facebook.com false / true 0 fr REDACTED .facebook.com false / true 0 spin REDACTED

So I have spin, and you have m_pixel_ratio.... would that have been a cause?

joebah-joe commented 3 years ago

@neon-ninja Okay, so I think adding the log line to that file worked. Here's an example of the output (I uploaded the rest in the log.txt)

The page url is: https://m.facebook.com/page_content_list_view/more/?page_id=113397052526245&start_cursor=%7B%22timeline_cursor%22%3A%22AQHR8NaG7VG06f3etA7LQrGJzIYOUqnjr-cDl27yAPeGdEzuPxm_ULbj6peC-U1ldjjDsH0mcmfnOuyt2gHT8efhp6kJvJ4hbEJc_xqisBAXw5P3vEJP8ZXEoIYMUP2TywSC%22%2C%22timeline_section_cursor%22%3Anull%2C%22has_next_page%22%3Atrue%7D&num_to_fetch=4&surface_type=timeline

and the HTML code is:

HTML url='https://m.facebook.com/page_content_list_view/more/?page_id=113397052526245&start_cursor=%7B%22timeline_cursor%22%3A%22AQHR8NaG7VG06f3etA7LQrGJzIYOUqnjr-cDl27yAPeGdEzuPxm_ULbj6peC-U1ldjjDsH0mcmfnOuyt2gHT8efhp6kJvJ4hbEJc_xqisBAXw5P3vEJP8ZXEoIYMUP2TywSC%22%2C%22timeline_section_cursor%22%3Anull%2C%22has_next_page%22%3Atrue%7D&num_to_fetch=4&surface_type=timeline'

I posted a bunch of wrong stuff earlier so forget that iteration (edited it out of this comment).

log.txt

neon-ninja commented 3 years ago

@joebah-joe

can't just add that logger line in /home/user/.virtualenvs/myvirtualenv/lib/python3.8/site-packages/page_iterators.py?

you could, but I don't think that's quite the right path, it would probably have facebook_scraper in it

So I have spin, and you have m_pixel_ratio

If I swap out m_pixel_ratio for spin it still works fine for me, so likely unrelated

There's still no HTML in your last comment, those are URLs. Try logger.debug(self.response.html.html) instead

joebah-joe commented 3 years ago

@neon-ninja

sorry my bad I wrote the path wrong - this was the full path:

/home/user/.virtualenvs/myvirtualenv/lib/python3.8/site-packages/facebook_scraper/page_iterators.py

just to make sure, I put this:

            logger.debug("start of the self response html portion")
            logger.debug(self.response.html)
            logger.debug("end of the self response html portion")

And ran again. This was the result (see log.txt)

if you search for "start of the self response" in the log file, it looks like each time it's only pulling very small amount of HTML tag which is basically just the URL again. I presume yours didn't look this? IT should have been the full HTML page and may be that';s what messing up the get_post function?

log.txt

neon-ninja commented 3 years ago

@joebah-joe self.response.html is a different object (parent class) to self.response.html.html (the actual HTML) - try the latter. Alternatively try raw_page.html

joebah-joe commented 3 years ago

@neon-ninja okay that works better - you can search for "start of the raw response html portion" in the attached log raw html.txt and you can see that we now have a big block of HTML tag.

I have no idea if this makes sense or not, but it does seem to contains all the content needed. Perhaps the HTML formatting is off?

log raw html.txt

neon-ninja commented 3 years ago

The scraper is expected elements like <article class="_55wo _5rgr _5gh8 _3drq async_like", but instead you're getting elements like <div class="_55wo _5rgr _5gh8 _3drq async_like" - it's like your cookies are telling facebook you don't support HTML5.

@kevinzg any ideas?

@joebah-joe in theory changing the selector at https://github.com/kevinzg/facebook-scraper/blob/e2ed95d7a7bc5957fbf6c5af4f21fcc0b86d5a94/facebook_scraper/page_iterators.py#L69 to something like raw_posts = raw_page.find('article[data-ft],div.async_like[data-ft]') should work for you, but I don't have any way of testing that

joebah-joe commented 3 years ago

The scraper is expected elements like <article class="_55wo _5rgr _5gh8 _3drq async_like", but instead you're getting elements like <div class="_55wo _5rgr _5gh8 _3drq async_like" - it's like your cookies are telling facebook you don't support HTML5.

@kevinzg any ideas?

@joebah-joe in theory changing the selector at

https://github.com/kevinzg/facebook-scraper/blob/e2ed95d7a7bc5957fbf6c5af4f21fcc0b86d5a94/facebook_scraper/page_iterators.py#L69

to something like raw_posts = raw_page.find('article[data-ft],div.async_like[data-ft]') should work for you, but I don't have any way of testing that

Okay, looks like we narrowed down the problem to page format. I'll also try the code you mention in the iterator...

kevinzg commented 3 years ago

No idea why it is sending divs instead of articles.

Here are some suggestions to get more consistent results:

joebah-joe commented 3 years ago

@kevinzg

Hi - thanks for the feedback. I will try doing those.

Out of curiosity - the debug we put in page_iterators.py seems to show that your scraper was already able to extract all the content on each page no problems. However, it still needed to output to HTML5 first so that it can parse the data from individual posts to get_post function? Just wondering how the process works. Thanks!

joebah-joe commented 3 years ago

@neon-ninja @kevinzg

Well imagine that, it finally worked!

Replacing raw_posts = raw_page.find('article') with raw_posts = raw_page.find('article[data-ft],div.async_like[data-ft]') I was finally able extract the pages from longtunman consistently. Other pages works too!

So it's a bummer that I have to alter the scraper code to fix this, as I'd imagine it won't be helpful to anyone who's not having the same HTML5 problem a me (I seem to be a rare occurrence here!) so this line won't be incorporated. But at least I hope I contributed something to this awesome project. May be some unlucky soul run into the same issue as me and you can just direct them to this post to fix the problem.

Thank you both for helping. There was no way I could have fixed this by myself. Learnt a lot (if you can't tell already, I'm a beginner at the whole Python development thing though I do have some experience coding in other languages). I guess this 'bug' can be closed now.

Thank again to you both, I know you guys put in a lot of time to help me out so I really appreciate it!

kevinzg commented 3 years ago

@joebah-joe Good to hear that!

Actually, that change would make very little harm, so I think it's worth adding.

About your previous question the scraper content you see on the log is extracted using a very simple method (print every text node), but at that point it doesn't know what's part of a post and what isn't, or where every post starts and ends, there's no structure just plain text. To structure the text into post objects/dictionaries the scraper needs to look more carefully into what each node means, for example, one rule was that an article node contains a post, now we know that's not always the case.

joebah-joe commented 3 years ago

@joebah-joe Good to hear that!

Actually, that change would make very little harm, so I think it's worth adding.

About your previous question the scraper content you see on the log is extracted using a very simple method (print every text node), but at that point it doesn't know what's part of a post and what isn't, or where every post starts and ends, there's no structure just plain text. To structure the text into post objects/dictionaries the scraper needs to look more carefully into what each node means, for example, one rule was that an article node contains a post, now we know that's not always the case.

Ah, so that's how it works. Thank you sir!

kevinzg commented 3 years ago

The fix has been released in v0.2.23.