kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.4k stars 628 forks source link

Inexact friends number #454

Open gdn0101 opened 3 years ago

gdn0101 commented 3 years ago

Comparing the number of friends per given profile with other tools, It seems that the script is outputting only a fraction of the results.

neon-ninja commented 3 years ago

What other tools?

gdn0101 commented 3 years ago

What other tools?

For instance the dupmitblue+ extension from the chrome web store. On some cases I found a difference of -10 of about 800 total. In others of -55 / about 83 total.

neon-ninja commented 3 years ago

The difference would probably be that dumpitblue+ uses the desktop version of facebook (facebook.com), whereas this scraper uses the mobile version (m.facebook.com)

gdn0101 commented 3 years ago

Just took a closer look. In your code you use the data-sigil='undoable-action' and h3 class as an identifier for the elements. When inspecting the friends list I found out that the number of friends returned by the mobile variant is correct but the identifier only finds a subset of the results. The undoable action identifier ommits the elements that don't have the "add friend" button, and the h3 identifier doesn't find all the elements. Maybe it would de be a good idea to try with the "_5pxa" div class and funnel down with a "_5pxc" div class ?

jocejocejoe commented 2 years ago

FWIW, I have fixed this problem in this way, but my python is terrible and maybe this is a hack, so not proposing it as a merge request. Anyway, rather than looking for undoable-action, I look for div[class="timeline"] and descend 2 levels of divs from there, then filter/vet what's found.

diff --git a/facebook_scraper/facebook_scraper.py b/facebook_scraper/facebook_scraper.py
index 6f07834..d1d1573 100644
--- a/facebook_scraper/facebook_scraper.py
+++ b/facebook_scraper/facebook_scraper.py
@@ -154,18 +154,30 @@ class FacebookScraper:
         while friend_url:
             logger.debug(f"Requesting page from: {friend_url}")
             response = self.get(friend_url)
-            elems = response.html.find('div[data-sigil="undoable-action"]')
+            elems = response.html.find('div[class="timeline"] > div > div')
             logger.debug(f"Found {len(elems)} friends")
             for elem in elems:
                 name = elem.find("h3>a", first=True)
-                tagline = elem.find("div.notice.ellipsis", first=True).text
+                if not name:
+                    continue
+                # Tagline
+                tagline = elem.find("span.fcg", first=True)
+                if tagline:
+                    tagline = tagline.text
+                else:
+                    tagline = ""
+                # Profile Picture
                 profile_picture = elem.find("i.profpic", first=True).attrs.get("style")
                 match = re.search(r"url\('(.+)'\)", profile_picture)
                 if match:
                     profile_picture = utils.decode_css_url(match.groups()[0])
-                user_id = json.loads(
-                    elem.find("a.touchable[data-store]", first=True).attrs["data-store"]
-                ).get("id")
+                # User ID if present, not present if no "add friend"
+                user_id= elem.find("a.touchable[data-store]", first=True)
+                if user_id:
+                    user_id = json.loads(user_id.attrs["data-store"]).get("id")
+                else:
+                    user_id = ""
+
                 friend = {
                     "id": user_id,
                     "link": name.attrs.get("href"),
jocejocejoe commented 2 years ago

As a side note, adding a random/long-ish sleep before getting more friends seems like it helps with the temp ban / throttling. This does slow it down a lot though. But not as much as a temp ban does.

             more = re.search(r'm_more_friends",href:"([^"]+)"', response.text)
             if more:
+                time.sleep(randrange(100)/10)
                 friend_url = utils.urljoin(FB_MOBILE_BASE_URL, more.group(1))
neon-ninja commented 2 years ago

This looks good to me, I would have accepted that as a pull request. I would have preferred a pull request than a git patch, but I've committed that and attributed you in https://github.com/kevinzg/facebook-scraper/commit/7aaf33d0d5adfd3ab8c356963e7f9c4b7433fc25, plus one minor tweak in https://github.com/kevinzg/facebook-scraper/commit/8fb79bed8bd5b2696d7d618f66d03030f9d713c2.

With adding sleeps, the length of the sleep should be configurable by the user. For larger friend extraction jobs, users can iterate through the get_friends generator, and sleep to their preference. Like so: https://github.com/kevinzg/facebook-scraper/issues/382#issuecomment-874369929

jocejocejoe commented 2 years ago

Ok, thanks! Having just wandered into this project I really didn't know if it was fit for a pull req or just a rough hack. Oh, and the h1 thing is interesting. I may have stumbled across that as well but wasn't sure.