kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.29k stars 618 forks source link

The method to use multi cookies #621

Open JJery-web opened 2 years ago

JJery-web commented 2 years ago

For more stable to scrape the data, I use my other email to register the account. So I use these cookies to scrape. But I find these new accounts are banned (30 days). I use the different computer or use the cycle function for the cookies like"from itertools import cycle cookies = cycle(["cookies.json","mycookies4.json","mycookies.json"])"

But my accounts are banned 30 days easily. Do anyone know the problems?

neon-ninja commented 2 years ago

You're scraping too much, too fast. Try add in some time.sleep

JJery-web commented 2 years ago

You're scraping too much, too fast. Try add in some time.sleep

Thanks. So what is the reasonable frequency please. (For scrape every posts and every user)

neon-ninja commented 2 years ago

Less than 10K posts per day should be safe

JJery-web commented 2 years ago

Less than 10K posts per day should be safe

Thanks. Another question is I find that I use the package and get a lot of messy information like this:

output: [datetime.datetime(2018, 2, 20, 8, 36, 36), 'https://facebook.com/387254604621318/posts/2032170880129674', 'Wow! Beautiful!\nAmanda Akers\nFebruary 19, 2018 at 6:49 PM ·\nI am happy to have this feather done for the Turkey Federation! They wanted a painting of the Martin\'s Mill Bridge with turkeys, so this is what they ended up with.\nLike\n1\nLike\nShow more reactions\nComment\nShare\nLoading...\nTry Again\nCancel\nLoading...\nLoading...\nMPageLoadClientMetrics.logFirstPaint(true);requireLazy(["HasteSupportData"],function(m){m.handle({"gkxData":{"5241":{"result":true,"hash":"AT7o1bCQPGpf3ShEbEk"},"676920":{"result":false,"hash":"AT497IX4gOFG8gZe4Qg"},"708253":{"result":false,"hash":"AT5n4hBL3YTMnQWtRYY"},"996940":{"result":false,"hash":"AT7opYuEGy3sjG1a3U8"},"1263340":{"result":false,"hash":"AT5bwizWgDaFQudmG1M"}}})});requireLazy(["TimeSliceImpl","ServerJS"],function(TimeSlice,ServerJS){(new ServerJS()).handle({"define":[["cr:696703",[],{"rc":[null,"Aa3vMoNXnT-az5dfGQ7KUWhKrAwbWXEKNvAh6QGKbl-PH1SGzZb2LoRiO9TrQkZ7vZp7SC2QOfL5EcPgGIQfPhq68nQ"]},-1],["cr:717822",["TimeSliceImpl"],{"rc":["TimeSliceImpl","Aa3vMoNXnT-az5dfGQ7KUWhKrAwbWXEKNvAh6QGKbl-PH1SGzZb2LoRiO9TrQkZ7vZp7SC2QOfL5EcPgGIQfPhq68nQ"]},-1],["CometPersistQueryParams",[],{"relative":{},"domain":{}},6231],["BigPipeExperiments",[],{"link_images_to_pagelets":false,"enable_bigpipe_plugins":false},907],["BootloaderConfig",[],{"deferBootloads":false,"jsRetries":[200,500],"jsRetryAbortNum":2,"jsRetryAbortTime":5,"silentDups":false,"hypStep4":false,"phdOn":false,"btCutoffIndex":583},329],["CSSLoaderConfig",[],{"timeout":5000,"modulePrefix":"BLCSS:","loadEventSupported":true},619],["CurrentCommunityInitialData",[],{},490],["CurrentUserInitialData",[],{"ACCOUNT_ID":"100076598870082","USER_ID":"100076598870082","NAME":"Na Zhang","SHORT_NAME":"Na","IS_BUSINESS_PERSON_ACCOUNT":false,"HAS_SECONDARY_BUSINESS_PERSON":false,"IS_FACEBOOK_WORK_ACCOUNT":false,"IS_MESSENGER_ONLY_USER":false,"IS_DEACTIVATED_ALLOWED_ON_MESSENGER":false,"IS_MESSENGER_CALL_GUEST_USER":false,"IS_WORK_MESSENGER_CALL_GUEST_USER":false,"APP_ID":"412378670482","IS_BUSINESS_DOMAIN":false},270],["ErrorDebugHooks",[],{"SnapShotHook":null},185],["ISB",[],{},330],["LSD",[],{"token":"sO6a9Q9gdOqMDOKBJfDV7C"},323],["MRequestConfig",[],{"dtsg":{"token":"AQECxXg0x2gWUBA:5:1641436206","valid_for":86400,"expire":1641525056},

I use the code: for post in get_posts(account=link, pages=None, timeout=120,cookies=cookies.json, options={"allow_extra_requests": False,"reactions":False,"posts_per_page": 300}): post_info = [] post_info.append(post['time']) post_info.append(post["post_url"]) post_info.append(post['text']) post_info.append(post['likes']) post_info.append(post['comments']) post_info.append(post['shares']) post_info.append(post["shared_text"]) post_info.append(post['link']) post_info.append(post['images_lowquality_description']) post_info.append(post['video']) post_info.append(post['video_duration_seconds']) post_info.append(post['video_watches']) post_info.append(post['image']) post_info.append(post['reactions']) print(post_info) #here output the messy information.


So I get a very messy information in csv. I don't know how to deal with this.

neon-ninja commented 2 years ago

try update lxml with pip install -U lxml

JJery-web commented 2 years ago

try update lxml with pip install -U lxml

Yes. It works. Thank you bro.