Open wgicio opened 3 years ago
Give latest master a try. I added a few things:
-d DAYS_LIMIT, --days-limit DAYS_LIMIT
Number of days to download
-rf RESUME_FILE, --resume-file RESUME_FILE
Filename to store the last pagination URL in, for resuming
-k KEYS, --keys KEYS Comma separated list of which keys or columns to return. This lets you filter to just your desired outputs.
2>error.log
page_id
-m MATCHING, --matching MATCHING
Filter to just posts matching string
Putting it all together:
facebook-scraper Nintendo -vvv -f - -fmt json -d 3 -ner -rf resume -k post_id,time,page_id,likes -m Metroid 2>error.log
outputs:
[
{
"post_id": "4549796628438088",
"time": "2021-09-30 09:00:02",
"likes": 341,
"page_id": "119240841493711"
},
]
Also, results now stream to the file as the scraping is running, instead of collecting everything and then writing everything at once.
You are amazing!! thank you so much neon-ninja!!!
Minor bug when running without the -d argument
$ facebook-scraper 739201996218855 --filename 739201996218855-test2.csv --cookies fbcookie.txt --group --verbose
<pre>Traceback (most recent call last): File "/home/macca/.local/bin/facebook-scraper", line 8, in <module> sys.exit(run()) File "/home/macca/.local/lib/python3.8/site-packages/facebook_scraper/__main__.py", line 109, in run write_posts_to_csv( File "/home/macca/.local/lib/python3.8/site-packages/facebook_scraper/__init__.py", line 318, in write_posts_to_csv max_post_time = datetime.now() - timedelta(days=days_limit) TypeError: unsupported type for timedelta days component: NoneType </pre>
Would it be possible to get the -m MATCHING argument a comma separated list?
perhaps it would be better to take a regex expression? Are you familiar with regex?
A little however it would be easier for the average user to use a coma separated list my use case is that I only want to get posts with contact details, phone number formats vary -m whatsapp,wa.me,6282,082,6285,0852 etc etc
It would be pipe separated if regex. Like -m 'whatsapp|wa.me|6282|082|6285|0852'
perfect!
An option to also exclude keywords would be also great
regex supports that sort of thing with negative lookaheads
looks like its working:) how do you combine the word list with a negative lookup? 'wa.me|whatsapp|Whats App' |(?!for sale|leasehold)
Looks like combining a matching expression with a negative lookahead becomes a bit complicated. I've added a new argument in https://github.com/kevinzg/facebook-scraper/commit/323bacc63ee2cf387c237819989b426efa029aea,
-nm NOT_MATCHING, --not-matching NOT_MATCHING
Filter to just posts not matching regex expression
HI I'm getting a type error on the keyword search function even though no argument for it gets passed
$ facebook-scraper 294118640941955 --filename test.csv --cookies fbcookie.txt --group -d 14 --sleep 1 --k post_id,text,time,image,images,link,user_id,username,image_id,image_ids,page_id --verbose /home/macca/.local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py:473: UserWarning: Locale detected as en_GB - for best results, set to en_US warnings.warn(f"Locale detected as {locale} - for best results, set to en_US") [1512860725734401] Extract method extract_video didn't return anything [1512860725734401] Extract method extract_video_thumbnail didn't return anything [1512860725734401] Extract method extract_video_id didn't return anything [1512860725734401] Extract method extract_video_meta didn't return anything [1512860725734401] Extract method extract_factcheck didn't return anything [1512860725734401] Extract method extract_share_information didn't return anything [1512860725734401] Extract method extract_listing didn't return anything [1650790328608106] Extract method extract_video didn't return anything [1650790328608106] Extract method extract_video_thumbnail didn't return anything [1650790328608106] Extract method extract_video_id didn't return anything [1650790328608106] Extract method extract_video_meta didn't return anything [1650790328608106] Extract method extract_factcheck didn't return anything [1650790328608106] Extract method extract_share_information didn't return anything [1650790328608106] Extract method extract_listing didn't return anything [1651279148559224] Extract method extract_video didn't return anything [1651279148559224] Extract method extract_video_thumbnail didn't return anything [1651279148559224] Extract method extract_video_id didn't return anything [1651279148559224] Extract method extract_video_meta didn't return anything [1651279148559224] Extract method extract_factcheck didn't return anything [1651279148559224] Extract method extract_share_information didn't return anything [1651279148559224] Extract method extract_listing didn't return anything [None] Extract method extract_post_url didn't return anything [1651279148559224] Extract method extract_text didn't return anything [1651279148559224] Extract method extract_username didn't return anything [1651279148559224] Extract method extract_video didn't return anything [1651279148559224] Extract method extract_video_thumbnail didn't return anything [1651279148559224] Extract method extract_video_id didn't return anything [1651279148559224] Extract method extract_video_meta didn't return anything [1651279148559224] Exception while running extract_is_live: IndexError('list index out of range') [1651279148559224] Extract method extract_factcheck didn't return anything [1651279148559224] Extract method extract_share_information didn't return anything [1651279148559224] Extract method extract_listing didn't return anything [1651279148559224] Extract method extract_with didn't return anything Traceback (most recent call last): File "/home/macca/.local/lib/python3.8/site-packages/facebook_scraper/__init__.py", line 366, in write_posts_to_csv match = re.search(kwargs.get("matching"), post["text"], flags=re.IGNORECASE) File "/usr/lib/python3.8/re.py", line 201, in search return _compile(pattern, flags).search(string) TypeError: expected string or bytes-like object
I have a few ideas which should make peoples lives easier when scraping
1) It would be great to be able to have a CLI option to stop scraping once the posts reach a certain date. Would be useful if someone wants to scrape the last weeks worth of posts
2) The ability to start the scraping from a specific starting point, either a date, or from when a previous scrape stopped due to a temporary block by Facebook
3) A way to specify what needs to be scraped eg --HQimages 1 --LQimages 0 --videos 0 -- reactions 0 -- factcheck 0
4) save the debug log alongside the filename for each scrape facebook-scraper --filename nintendo_page_posts.csv --verbose would also save a file a file called nintendo_page_posts.log
5) an extra column in the saved data with the group / page ID would also be nice for identifying the source when combining multiple scrapes
6) only save the record if a keyword is present in the post text
thanks!