Some pages need `/posts/` to be part of the URL

kevinzg commented 3 years ago

We removed the posts/ suffix of the initial URL (#170) because some pages where having problems with that (#148, #162).

But there are some pages that doesn't work without it, e.g. longtunman.

So we need a way to handle both kind of pages.

kevinzg commented 3 years ago

@neon-ninja do you have any idea about how to handle this?

neon-ninja commented 3 years ago

I tested with v0.2.19, v0.2.20 and master using this code:

posts = list(get_posts("longtunman"))
print(f"{len(posts)} posts, {len([post for post in posts if not post['text']])} missing text")

And these were the results:

v0.2.19: 34 posts, 2 missing text v0.2.20: 32 posts, 2 missing text master: 32 posts, 2 missing text

So everything seems to be working fine for me

kevinzg commented 3 years ago

I ran your test and got similar results. It turns out that when I first tried I used a low page limit, 1 and 2, and it returned 0 posts, that's why I thought it wasn't working.

Anyway, it looks like the new URL isn't skipping any posts, they are just kinda being delayed 2 pages later, so I'll just add a warning to not use low page limit number for now.

Thanks for your input.

lgjluis commented 3 years ago

Hi Kevin,

I have been testing versions v2.0.19, v2.0.20 and v2.0.21. The best version has been v2.0.19, bringing more results except for some cases where the post_text and shared_text returns NONE. Future versions will have the same functionality as v2.0.20 or v2.0.21?

kevinzg commented 3 years ago

Versions 20+ need to get more pages to get the same number of posts as version 19, that's true. So incrementing the page limit should "fix" it.

Have you found cases where versions 20 and 21 can't reach posts that can be extracted with 19?.

lgjluis commented 3 years ago

Yes, I use a BAT file with different accounts to generate searches every 5 minutes in parallel. The app brings information two or three times but then it doesn't. With version 19 always brings.

neon-ninja commented 3 years ago

In that case, perhaps we should try hit /posts and fallback to / if it fails

Edit: looks like kevin had the same idea https://github.com/kevinzg/facebook-scraper/commit/2447e703bb7bfb189b78be0cf6790f7b5d902b71

kevinzg commented 3 years ago

I've merged that commit.

@lgjluis Try with 0.2.23

neon-ninja commented 3 years ago

@kevinzg doesn't look like that's working:

Starting to iterate pages
Requesting page from: https://m.facebook.com/dezsoandraskonyvei/posts/
Exception while requesting URL: https://m.facebook.com/dezsoandraskonyvei/posts/
Exception: HTTPError('404 Client Error: Not Found for url: https://m.facebook.com/dezsoandraskonyvei/posts/')
Traceback (most recent call last):
  File "/home/nyou045/git/facebook-scraper/facebook_scraper/facebook_scraper.py", line 56, in get
    response.raise_for_status()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://m.facebook.com/dezsoandraskonyvei/posts/
Traceback (most recent call last):
  File "./test_login.py", line 8, in <module>
    posts = list(get_posts("dezsoandraskonyvei", cookies="cookies.txt"))
  File "/home/nyou045/git/facebook-scraper/facebook_scraper/facebook_scraper.py", line 107, in _generic_get_posts
    for i, page in zip(counter, iter_pages_fn()):
  File "/home/nyou045/git/facebook-scraper/facebook_scraper/page_iterators.py", line 40, in generic_iter_pages
    response = request_fn(next_url)
  File "/home/nyou045/git/facebook-scraper/facebook_scraper/facebook_scraper.py", line 56, in get
    response.raise_for_status()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

Looks like if ex.response/bool(Message: <Response [404]>) evaluates to False. Also the exception doesn't get caught as it's only triggered when the generator starts being consumed. Perhaps something like

start_url = utils.urljoin(FB_MOBILE_BASE_URL, f'/{account}/posts/')
try:
    request_fn(start_url)
except HTTPError as ex:
    start_url = utils.urljoin(FB_MOBILE_BASE_URL, f'/{account}/')

would work better

kevinzg commented 3 years ago

My bad, well, let's keep it simple.

balazssandor commented 3 years ago

@kevinzg @neon-ninja 0.2.23 still returns nothing for dezsoandraskonyvei page. It's just that we treat that exception, but we don't get back anything.

>>> list(facebook_scraper.get_posts('dezsoandraskonyvei',pages=6))
[]

kevinzg commented 3 years ago

@balazssandor What about 0.2.24?

balazssandor commented 3 years ago

@balazssandor What about 0.2.24?

@kevinzg Sorry, I meant 0.2.24, I was trying with that one.

neon-ninja commented 3 years ago

@balazssandor like I said in #172, dezsoandraskonyvei only shows posts if logged in (ie, you need to pass in cookies)

lgjluis commented 3 years ago

Hi Kevin,

I have tested version 24 and it works fine. There are some accounts that bring NONE as text. I copy the link in case you want to validate.

https://facebook.com/Diario-Los-Andes-Oficial-1774848076073391

neon-ninja commented 3 years ago

@lgjluis there's lots of posts that have None set for the text - if a user shares an image with no associated text, or shares a link to a group with no associated text, or changes their profile picture or description, or shares a video, or if the post is deleted (perhaps we should have a deleted flag?)

lgjluis commented 3 years ago

Hello @neon-ninja,

I was reviewing the cases that returned NONE in the text and the cause is for "Read more". When the post is very long and has the option to "Read more", the HTML returns the login page (This is how it is developed?). For that reason the text is NONE. I used the cookies option and everything works great.

lgjluis commented 3 years ago

Hello @neon-ninja and @kevinzg,

Is there a risk of account blocking when using cookies? I have a Facebook message saying that my account has been temporarily blocked.

neon-ninja commented 3 years ago

@lgjluis yes, sooner or later Facebook wants you to login. I have seen that too when scraping larger quantities of posts, but like it says, it is only temporary. Everything seems to reset within an hour or so.

kevinzg / facebook-scraper

Some pages need `/posts/` to be part of the URL #176