kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.42k stars 628 forks source link

CLI feature enhancements #498

Open wgicio opened 3 years ago

wgicio commented 3 years ago

I have a few ideas which should make peoples lives easier when scraping

1) It would be great to be able to have a CLI option to stop scraping once the posts reach a certain date. Would be useful if someone wants to scrape the last weeks worth of posts

2) The ability to start the scraping from a specific starting point, either a date, or from when a previous scrape stopped due to a temporary block by Facebook

3) A way to specify what needs to be scraped eg --HQimages 1 --LQimages 0 --videos 0 -- reactions 0 -- factcheck 0

4) save the debug log alongside the filename for each scrape facebook-scraper --filename nintendo_page_posts.csv --verbose would also save a file a file called nintendo_page_posts.log

5) an extra column in the saved data with the group / page ID would also be nice for identifying the source when combining multiple scrapes

6) only save the record if a keyword is present in the post text

thanks!

neon-ninja commented 3 years ago

Give latest master a try. I added a few things:

  1. I added an argument for limiting to the last n days, like so:
    -d DAYS_LIMIT, --days-limit DAYS_LIMIT
                        Number of days to download
  2. I added an argument for storing the last pagination URL to a file, and to resume from that file:
    -rf RESUME_FILE, --resume-file RESUME_FILE
                        Filename to store the last pagination URL in, for resuming
  3. I added an argument for specifying a comma separated list of keys:
    -k KEYS, --keys KEYS  Comma separated list of which keys or columns to return. This lets you filter to just your desired outputs.
  4. You can redirect stderr to a file to achieve this. In bash, you can say 2>error.log
  5. I added a new column called page_id
  6. I added a new argument:
    -m MATCHING, --matching MATCHING
                        Filter to just posts matching string

Putting it all together:

facebook-scraper Nintendo -vvv -f - -fmt json -d 3 -ner -rf resume -k post_id,time,page_id,likes -m Metroid 2>error.log

outputs:

[
{
    "post_id": "4549796628438088",
    "time": "2021-09-30 09:00:02",
    "likes": 341,
    "page_id": "119240841493711"
},
]

Also, results now stream to the file as the scraping is running, instead of collecting everything and then writing everything at once.

wgicio commented 3 years ago

You are amazing!! thank you so much neon-ninja!!!

wgicio commented 3 years ago

Minor bug when running without the -d argument

$ facebook-scraper 739201996218855 --filename 739201996218855-test2.csv --cookies fbcookie.txt --group --verbose <pre>Traceback (most recent call last): File &quot;/home/macca/.local/bin/facebook-scraper&quot;, line 8, in &lt;module&gt; sys.exit(run()) File &quot;/home/macca/.local/lib/python3.8/site-packages/facebook_scraper/__main__.py&quot;, line 109, in run write_posts_to_csv( File &quot;/home/macca/.local/lib/python3.8/site-packages/facebook_scraper/__init__.py&quot;, line 318, in write_posts_to_csv max_post_time = datetime.now() - timedelta(days=days_limit) TypeError: unsupported type for timedelta days component: NoneType </pre>

neon-ninja commented 3 years ago

https://github.com/kevinzg/facebook-scraper/commit/87eb4e4861f1a3451e4ad5b049eda82ddd109dbf should fix that

wgicio commented 3 years ago

Would it be possible to get the -m MATCHING argument a comma separated list?

neon-ninja commented 3 years ago

perhaps it would be better to take a regex expression? Are you familiar with regex?

wgicio commented 3 years ago

A little however it would be easier for the average user to use a coma separated list my use case is that I only want to get posts with contact details, phone number formats vary -m whatsapp,wa.me,6282,082,6285,0852 etc etc

neon-ninja commented 3 years ago

It would be pipe separated if regex. Like -m 'whatsapp|wa.me|6282|082|6285|0852'

wgicio commented 3 years ago

perfect!

wgicio commented 3 years ago

An option to also exclude keywords would be also great

neon-ninja commented 3 years ago

regex supports that sort of thing with negative lookaheads

neon-ninja commented 3 years ago

try https://github.com/kevinzg/facebook-scraper/commit/fcded53abbf504038331a924db2b1ffe7c1a0677

wgicio commented 3 years ago

looks like its working:) how do you combine the word list with a negative lookup? 'wa.me|whatsapp|Whats App' |(?!for sale|leasehold)

neon-ninja commented 3 years ago

Looks like combining a matching expression with a negative lookahead becomes a bit complicated. I've added a new argument in https://github.com/kevinzg/facebook-scraper/commit/323bacc63ee2cf387c237819989b426efa029aea,

  -nm NOT_MATCHING, --not-matching NOT_MATCHING
                        Filter to just posts not matching regex expression
wgicio commented 3 years ago

HI I'm getting a type error on the keyword search function even though no argument for it gets passed

$ facebook-scraper 294118640941955 --filename test.csv --cookies fbcookie.txt --group -d 14 --sleep 1 --k post_id,text,time,image,images,link,user_id,username,image_id,image_ids,page_id --verbose /home/macca/.local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py:473: UserWarning: Locale detected as en_GB - for best results, set to en_US warnings.warn(f"Locale detected as {locale} - for best results, set to en_US") [1512860725734401] Extract method extract_video didn't return anything [1512860725734401] Extract method extract_video_thumbnail didn't return anything [1512860725734401] Extract method extract_video_id didn't return anything [1512860725734401] Extract method extract_video_meta didn't return anything [1512860725734401] Extract method extract_factcheck didn't return anything [1512860725734401] Extract method extract_share_information didn't return anything [1512860725734401] Extract method extract_listing didn't return anything [1650790328608106] Extract method extract_video didn't return anything [1650790328608106] Extract method extract_video_thumbnail didn't return anything [1650790328608106] Extract method extract_video_id didn't return anything [1650790328608106] Extract method extract_video_meta didn't return anything [1650790328608106] Extract method extract_factcheck didn't return anything [1650790328608106] Extract method extract_share_information didn't return anything [1650790328608106] Extract method extract_listing didn't return anything [1651279148559224] Extract method extract_video didn't return anything [1651279148559224] Extract method extract_video_thumbnail didn't return anything [1651279148559224] Extract method extract_video_id didn't return anything [1651279148559224] Extract method extract_video_meta didn't return anything [1651279148559224] Extract method extract_factcheck didn't return anything [1651279148559224] Extract method extract_share_information didn't return anything [1651279148559224] Extract method extract_listing didn't return anything [None] Extract method extract_post_url didn't return anything [1651279148559224] Extract method extract_text didn't return anything [1651279148559224] Extract method extract_username didn't return anything [1651279148559224] Extract method extract_video didn't return anything [1651279148559224] Extract method extract_video_thumbnail didn't return anything [1651279148559224] Extract method extract_video_id didn't return anything [1651279148559224] Extract method extract_video_meta didn't return anything [1651279148559224] Exception while running extract_is_live: IndexError('list index out of range') [1651279148559224] Extract method extract_factcheck didn't return anything [1651279148559224] Extract method extract_share_information didn't return anything [1651279148559224] Extract method extract_listing didn't return anything [1651279148559224] Extract method extract_with didn't return anything Traceback (most recent call last): File "/home/macca/.local/lib/python3.8/site-packages/facebook_scraper/__init__.py", line 366, in write_posts_to_csv match = re.search(kwargs.get("matching"), post["text"], flags=re.IGNORECASE) File "/usr/lib/python3.8/re.py", line 201, in search return _compile(pattern, flags).search(string) TypeError: expected string or bytes-like object

neon-ninja commented 3 years ago

Try https://github.com/kevinzg/facebook-scraper/commit/f8343b04c7869a13d4767f4ca8e7f15e02f475f8