snarfed / bridgy

📣 Connects your web site to social media. Likes, retweets, mentions, cross-posting, and more...
https://brid.gy
Creative Commons Zero v1.0 Universal
724 stars 52 forks source link

bring back Facebook by scraping m.facebook.com #886

Closed snarfed closed 5 years ago

snarfed commented 5 years ago

https://m.facebook.com/ is a little-known "lite" version of Facebook's full webapp with no JS and fairly simple HTML. it requires login, specifically c_user and xs cookies, but it's eminently scrapeable. https://facebook-atom.appspot.com/ already scrapes it to generate Atom feeds. apart from how distasteful it is to scrape with login cookies, we could scrape it like Instagram to bring back Facebook backfeed!

...sadly, FB's blocking is better than IG's. i actually implemented the scraping and extracted posts, comments, and likes/reactions, but i haven't been able to fetch users' timelines consistently. after one or two requests, FB consistently starts redirecting requests to /login.php, even with all cookies that m.facebook.com gives me, fully spoofed User-Agent, and fetching from the same IP I logged in from. maybe browser fingerprinting? got me. this is where i stop digging. scraping, ugh.

related:

Lewiscowles1986 commented 4 years ago

A way to tell if https://mbasic.facebook.com is https://m.facebook.com

ping ``` lewiscowles@Lewiss-MacBook-Pro torrents % ping mbasic.facebook.com PING star.c10r.facebook.com (157.240.221.18): 56 data bytes 64 bytes from 157.240.221.18: icmp_seq=0 ttl=55 time=15.373 ms 64 bytes from 157.240.221.18: icmp_seq=1 ttl=55 time=16.342 ms 64 bytes from 157.240.221.18: icmp_seq=2 ttl=55 time=13.923 ms 64 bytes from 157.240.221.18: icmp_seq=3 ttl=55 time=12.656 ms 64 bytes from 157.240.221.18: icmp_seq=4 ttl=55 time=14.350 ms 64 bytes from 157.240.221.18: icmp_seq=5 ttl=55 time=18.655 ms ^C --- star.c10r.facebook.com ping statistics --- 6 packets transmitted, 6 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 12.656/15.216/18.655/1.919 ms lewiscowles@Lewiss-MacBook-Pro torrents % ping m.facebook.com PING star-mini.c10r.facebook.com (157.240.221.35): 56 data bytes 64 bytes from 157.240.221.35: icmp_seq=0 ttl=55 time=15.494 ms 64 bytes from 157.240.221.35: icmp_seq=1 ttl=55 time=15.000 ms 64 bytes from 157.240.221.35: icmp_seq=2 ttl=55 time=11.809 ms 64 bytes from 157.240.221.35: icmp_seq=3 ttl=55 time=11.920 ms 64 bytes from 157.240.221.35: icmp_seq=4 ttl=55 time=11.081 ms 64 bytes from 157.240.221.35: icmp_seq=5 ttl=55 time=20.812 ms 64 bytes from 157.240.221.35: icmp_seq=6 ttl=55 time=12.071 ms ```
traceroute ``` lewiscowles@Lewiss-MacBook-Pro torrents % traceroute m.facebook.com traceroute to star-mini.c10r.facebook.com (157.240.221.35), 64 hops max, 52 byte packets ... 6 ae13.pr04.lhr3.tfbnw.net (157.240.65.124) 13.349 ms 23.875 ms 39.767 ms 7 po131.asw01.lhr3.tfbnw.net (129.134.45.52) 16.781 ms po141.asw01.lhr3.tfbnw.net (129.134.45.56) 15.748 ms po131.asw02.lhr3.tfbnw.net (129.134.45.54) 21.754 ms 8 po223.psw01.lhr8.tfbnw.net (129.134.50.143) 13.670 ms po243.psw03.lhr8.tfbnw.net (129.134.50.105) 14.629 ms po233.psw03.lhr8.tfbnw.net (129.134.50.81) 15.957 ms 9 157.240.38.215 (157.240.38.215) 15.529 ms 157.240.38.209 (157.240.38.209) 14.165 ms 157.240.38.143 (157.240.38.143) 16.325 ms 10 edge-star-mini-shv-01-lhr8.facebook.com (157.240.221.35) 13.596 ms 14.736 ms 19.857 ms lewiscowles@Lewiss-MacBook-Pro torrents % traceroute mbasic.facebook.com traceroute to star.c10r.facebook.com (157.240.221.18), 64 hops max, 52 byte packets ... 6 ae4.pr02.lhr7.tfbnw.net (157.240.66.192) 16.236 ms 19.098 ms 15.590 ms 7 po121.asw02.lhr3.tfbnw.net (129.134.44.194) 15.165 ms po121.asw01.lhr3.tfbnw.net (129.134.44.190) 17.864 ms 18.253 ms 8 po231.psw04.lhr8.tfbnw.net (129.134.50.89) 18.365 ms po213.psw01.lhr8.tfbnw.net (129.134.50.31) 15.419 ms po243.psw04.lhr8.tfbnw.net (129.134.50.117) 17.272 ms 9 157.240.38.125 (157.240.38.125) 19.586 ms 173.252.67.29 (173.252.67.29) 32.396 ms 157.240.38.97 (157.240.38.97) 16.277 ms 10 edge-star-shv-01-lhr8.facebook.com (157.240.221.18) 19.260 ms 16.162 ms 17.667 ms ```

I think they are separate services hitting a single shared load balancer. Perhaps they are subtly different or occasionally imperceptibly different

Lewiscowles1986 commented 4 years ago

DNS records would be another way to go, but it could be a PaaS ingress/egress router so can be opaque

Lewiscowles1986 commented 4 years ago

You could fake the user agent and full browser request signature from the browser, but surveillance capitalists have recently been checking for subtle timing differences in implementations

https://www.gamingonlinux.com/articles/if-you-cant-login-to-world-of-warcraft-or-wow-classic-on-linux-heres-a-quick-fix-for-now.14967

snarfed commented 4 years ago

right. having worked on infrastructure, networking, and end user applications at another big tech company, i can confirm that none of those are really conclusive in any direction. fortunately it didn't really matter in this case.

Lewiscowles1986 commented 4 years ago

So no net new information.. Why ask me to comment or help if you are so sure you know better?