kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.29k stars 616 forks source link

Can I continue from where I left off? #287

Closed masalha-alaa closed 3 years ago

masalha-alaa commented 3 years ago

Hi,

First of all, this is a really great tool, so thank you very much, your work is appreciated! Now my question is if there's a way to continue from where I left off? I.e. I'd like to run my script for some time, stop it, and the next day run it again from the post I have reached, instead of going over all the posts I've already scraped. For example a parameter for posts time interval so I can ask to pull posts only starting from a certain point of time?

neon-ninja commented 3 years ago

This isn't how facebook works AFAIK, there's no way to jump to a year on a page or profile. That said, it is possible to temporarily suspend a process - you can press ctrl-z on linux. Then fg to resume.

masalha-alaa commented 3 years ago

I see, Ok thank you.

lennoxho commented 3 years ago

From what I understand, page_iterators is based on a chain of "ref:"(page_content_list_view..." GET requests. (For photos it's "href:"(/photos/pandora/...")

So, I believe we can do this:

  1. Dump/return the last request URL before exiting (on any Exception/KeyboardInterrupt)
  2. Extend get_posts() API to take a start_url.

In this example, the URL used is from the 4th scroll-page. So it doesn't seem like Facebook keeps track of any session info.

lennoxho@MSI:~$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> user_agent = (
...         "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
...         "AppleWebKit/537.36 (KHTML, like Gecko) "
...         "Chrome/76.0.3809.87 Safari/537.36"
...     )
>>> default_headers = {
...         'User-Agent': user_agent,
...         'Accept-Language': 'en-US,en;q=0.5',
...     }
>>> r = requests.get("https://m.facebook.com/page_content_list_view/more/?page_id=568665409957112&start_cursor=%7B%22timeline_cursor%22%3A%22AQHRJ5ugrsDzGW2mnm1KDiGp2ljavJEyiJrZeOxFxJSptwUNpoEKfPyhMjojewXRy9P63-PTYyqope4WnRsmR4eHf0DYRCNUg0Tgn8jcHC5u8vNT1bnv7Z2kBruaTw1tkRtl%22%2C%22timeline_section_cursor%22%3Anull%2C%22has_next_page%22%3Atrue%7D&num_to_fetch=4&surface_type=timeline")
>>> r
<Response [200]>
>>>
neon-ninja commented 3 years ago

Thanks for the suggestion - this commit should make it possible: https://github.com/kevinzg/facebook-scraper/commit/717d522e131cb9bae9ce3442216cadb5d9b82e21. Pagination URLs are already logged if you enable debug logging.

Sample usage:

cursor = "https://m.facebook.com/page_content_list_view/more/?page_id=119240841493711&start_cursor=%7B%22timeline_cursor%22%3A%22AQHRApGcI2yB96Ifenm0PzUD00GDt8XvxrgMmynUQ7W0t7FmP3NsglzmWoxdJ1lwWPQxuKk9E3DFnkSTCvMSWo4t_jYTxn72f4FVIhWkqSswYQS5HuxB66aBpkWO1nPeRSwS%22%2C%22timeline_section_cursor%22%3Anull%2C%22has_next_page%22%3Atrue%7D&num_to_fetch=4&surface_type=posts_tab"
print(next(get_posts("Nintendo", cookies="cookies.txt", pages=None, timeout=60, start_url=cursor, options={"allow_extra_requests": False}))["time"])

should output 2020-07-30 15:00:00

masalha-alaa commented 3 years ago

That's great, thanks! To avoid logging everything, I added a callback argument to **kwargs, and I call it in generic_iter_pages() every time a URL is processed:

def generic_iter_pages(start_url, page_parser_cls, request_fn: RequestFunction, **kwargs) -> Iterator[Page]:
    next_url = start_url

    request_url_callback = kwargs.get('request_url_callback')  # <==
    while next_url:
        if request_url_callback:  # <==
            request_url_callback(next_url)  # <==
        # ...
neon-ninja commented 3 years ago

Sounds like a good addition - would you like to submit a pull request?

masalha-alaa commented 3 years ago

Oh definitely! That'd be my first pull request ever so I'd really appreciate it! I hope I won't break things up though

bipsen commented 3 years ago

I assume this was implemented in 717d522e131cb9bae9ce3442216cadb5d9b82e21, but I'm still not sure how to use it. I understand that I can pass in start_url, but how do I got about saving the last page url without logging everything?

masalha-alaa commented 3 years ago

You have to log the last URL. See #291 for logging only the URL without additional info. Usage example:

def request_url_callback(url):
    # here I log every url request, but you can overwrite your log instead of appending to avoid a large size log file.
    global log_file
    print_log(f'Requesting page from:\n{url}', log_file)
for post in get_posts('verge', pages=None, request_url_callback=request_url_callback):
    # ...
bipsen commented 3 years ago

Great, thanks a lot!