Closed masalha-alaa closed 3 years ago
This isn't how facebook works AFAIK, there's no way to jump to a year on a page or profile. That said, it is possible to temporarily suspend a process - you can press ctrl-z on linux. Then fg
to resume.
I see, Ok thank you.
From what I understand, page_iterators
is based on a chain of
"ref:"(page_content_list_view..."
GET requests.
(For photos it's "href:"(/photos/pandora/..."
)
So, I believe we can do this:
start_url
.In this example, the URL used is from the 4th scroll-page. So it doesn't seem like Facebook keeps track of any session info.
lennoxho@MSI:~$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> user_agent = (
... "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
... "AppleWebKit/537.36 (KHTML, like Gecko) "
... "Chrome/76.0.3809.87 Safari/537.36"
... )
>>> default_headers = {
... 'User-Agent': user_agent,
... 'Accept-Language': 'en-US,en;q=0.5',
... }
>>> r = requests.get("https://m.facebook.com/page_content_list_view/more/?page_id=568665409957112&start_cursor=%7B%22timeline_cursor%22%3A%22AQHRJ5ugrsDzGW2mnm1KDiGp2ljavJEyiJrZeOxFxJSptwUNpoEKfPyhMjojewXRy9P63-PTYyqope4WnRsmR4eHf0DYRCNUg0Tgn8jcHC5u8vNT1bnv7Z2kBruaTw1tkRtl%22%2C%22timeline_section_cursor%22%3Anull%2C%22has_next_page%22%3Atrue%7D&num_to_fetch=4&surface_type=timeline")
>>> r
<Response [200]>
>>>
Thanks for the suggestion - this commit should make it possible: https://github.com/kevinzg/facebook-scraper/commit/717d522e131cb9bae9ce3442216cadb5d9b82e21. Pagination URLs are already logged if you enable debug logging.
Sample usage:
cursor = "https://m.facebook.com/page_content_list_view/more/?page_id=119240841493711&start_cursor=%7B%22timeline_cursor%22%3A%22AQHRApGcI2yB96Ifenm0PzUD00GDt8XvxrgMmynUQ7W0t7FmP3NsglzmWoxdJ1lwWPQxuKk9E3DFnkSTCvMSWo4t_jYTxn72f4FVIhWkqSswYQS5HuxB66aBpkWO1nPeRSwS%22%2C%22timeline_section_cursor%22%3Anull%2C%22has_next_page%22%3Atrue%7D&num_to_fetch=4&surface_type=posts_tab"
print(next(get_posts("Nintendo", cookies="cookies.txt", pages=None, timeout=60, start_url=cursor, options={"allow_extra_requests": False}))["time"])
should output 2020-07-30 15:00:00
That's great, thanks!
To avoid logging everything, I added a callback argument to **kwargs
, and I call it in generic_iter_pages()
every time a URL is processed:
def generic_iter_pages(start_url, page_parser_cls, request_fn: RequestFunction, **kwargs) -> Iterator[Page]:
next_url = start_url
request_url_callback = kwargs.get('request_url_callback') # <==
while next_url:
if request_url_callback: # <==
request_url_callback(next_url) # <==
# ...
Sounds like a good addition - would you like to submit a pull request?
Oh definitely! That'd be my first pull request ever so I'd really appreciate it! I hope I won't break things up though
I assume this was implemented in 717d522e131cb9bae9ce3442216cadb5d9b82e21, but I'm still not sure how to use it. I understand that I can pass in start_url, but how do I got about saving the last page url without logging everything?
You have to log the last URL. See #291 for logging only the URL without additional info. Usage example:
def request_url_callback(url):
# here I log every url request, but you can overwrite your log instead of appending to avoid a large size log file.
global log_file
print_log(f'Requesting page from:\n{url}', log_file)
for post in get_posts('verge', pages=None, request_url_callback=request_url_callback):
# ...
Great, thanks a lot!
Hi,
First of all, this is a really great tool, so thank you very much, your work is appreciated! Now my question is if there's a way to continue from where I left off? I.e. I'd like to run my script for some time, stop it, and the next day run it again from the post I have reached, instead of going over all the posts I've already scraped. For example a parameter for posts time interval so I can ask to pull posts only starting from a certain point of time?