Closed kevinzg closed 3 years ago
@neon-ninja do you have any idea about how to handle this?
I tested with v0.2.19, v0.2.20 and master using this code:
posts = list(get_posts("longtunman"))
print(f"{len(posts)} posts, {len([post for post in posts if not post['text']])} missing text")
And these were the results:
v0.2.19: 34 posts, 2 missing text v0.2.20: 32 posts, 2 missing text master: 32 posts, 2 missing text
So everything seems to be working fine for me
I ran your test and got similar results. It turns out that when I first tried I used a low page limit, 1 and 2, and it returned 0 posts, that's why I thought it wasn't working.
Anyway, it looks like the new URL isn't skipping any posts, they are just kinda being delayed 2 pages later, so I'll just add a warning to not use low page limit number for now.
Thanks for your input.
Hi Kevin,
I have been testing versions v2.0.19, v2.0.20 and v2.0.21. The best version has been v2.0.19, bringing more results except for some cases where the post_text and shared_text returns NONE. Future versions will have the same functionality as v2.0.20 or v2.0.21?
Versions 20+ need to get more pages to get the same number of posts as version 19, that's true. So incrementing the page limit should "fix" it.
Have you found cases where versions 20 and 21 can't reach posts that can be extracted with 19?.
Yes, I use a BAT file with different accounts to generate searches every 5 minutes in parallel. The app brings information two or three times but then it doesn't. With version 19 always brings.
In that case, perhaps we should try hit /posts and fallback to / if it fails
Edit: looks like kevin had the same idea https://github.com/kevinzg/facebook-scraper/commit/2447e703bb7bfb189b78be0cf6790f7b5d902b71
I've merged that commit.
@lgjluis Try with 0.2.23
@kevinzg doesn't look like that's working:
Starting to iterate pages
Requesting page from: https://m.facebook.com/dezsoandraskonyvei/posts/
Exception while requesting URL: https://m.facebook.com/dezsoandraskonyvei/posts/
Exception: HTTPError('404 Client Error: Not Found for url: https://m.facebook.com/dezsoandraskonyvei/posts/')
Traceback (most recent call last):
File "/home/nyou045/git/facebook-scraper/facebook_scraper/facebook_scraper.py", line 56, in get
response.raise_for_status()
File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://m.facebook.com/dezsoandraskonyvei/posts/
Traceback (most recent call last):
File "./test_login.py", line 8, in <module>
posts = list(get_posts("dezsoandraskonyvei", cookies="cookies.txt"))
File "/home/nyou045/git/facebook-scraper/facebook_scraper/facebook_scraper.py", line 107, in _generic_get_posts
for i, page in zip(counter, iter_pages_fn()):
File "/home/nyou045/git/facebook-scraper/facebook_scraper/page_iterators.py", line 40, in generic_iter_pages
response = request_fn(next_url)
File "/home/nyou045/git/facebook-scraper/facebook_scraper/facebook_scraper.py", line 56, in get
response.raise_for_status()
File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
Looks like if ex.response
/bool(Message: <Response [404]>)
evaluates to False. Also the exception doesn't get caught as it's only triggered when the generator starts being consumed.
Perhaps something like
start_url = utils.urljoin(FB_MOBILE_BASE_URL, f'/{account}/posts/')
try:
request_fn(start_url)
except HTTPError as ex:
start_url = utils.urljoin(FB_MOBILE_BASE_URL, f'/{account}/')
would work better
My bad, well, let's keep it simple.
@kevinzg @neon-ninja 0.2.23
still returns nothing for dezsoandraskonyvei
page. It's just that we treat that exception, but we don't get back anything.
>>> list(facebook_scraper.get_posts('dezsoandraskonyvei',pages=6))
[]
@balazssandor What about 0.2.24?
@balazssandor What about 0.2.24?
@kevinzg Sorry, I meant 0.2.24
, I was trying with that one.
@balazssandor like I said in #172, dezsoandraskonyvei only shows posts if logged in (ie, you need to pass in cookies)
Hi Kevin,
I have tested version 24 and it works fine. There are some accounts that bring NONE as text. I copy the link in case you want to validate.
https://facebook.com/Diario-Los-Andes-Oficial-1774848076073391
@lgjluis there's lots of posts that have None set for the text - if a user shares an image with no associated text, or shares a link to a group with no associated text, or changes their profile picture or description, or shares a video, or if the post is deleted (perhaps we should have a deleted flag?)
Hello @neon-ninja,
I was reviewing the cases that returned NONE in the text and the cause is for "Read more". When the post is very long and has the option to "Read more", the HTML returns the login page (This is how it is developed?). For that reason the text is NONE. I used the cookies option and everything works great.
Hello @neon-ninja and @kevinzg,
Is there a risk of account blocking when using cookies? I have a Facebook message saying that my account has been temporarily blocked.
@lgjluis yes, sooner or later Facebook wants you to login. I have seen that too when scraping larger quantities of posts, but like it says, it is only temporary. Everything seems to reset within an hour or so.
We removed the
posts/
suffix of the initial URL (#170) because some pages where having problems with that (#148, #162).But there are some pages that doesn't work without it, e.g.
longtunman
.So we need a way to handle both kind of pages.