facebook: resolve post vs photo ids in syndication urls

snarfed commented 9 years ago

bridgy successfully found and used the twitter and instagram syndication urls for my last two posts, but not the facebook ones.

https://snarfed.org/2015-10-02_15601 https://snarfed.org/2015-10-02_15607

they were originally m.facebook.com urls like this, which it didn't seem to find at all.

https://m.facebook.com/story.php?story_fbid=10101911911520783&id=212038

i later changed them to www links like this, which it found but hasn't actually connected to the responses.

https://www.facebook.com/212038/posts/10101911911520783

i hate facebook.

snarfed commented 9 years ago

ugh, now this one too. oddly twitter, not FB.

https://snarfed.org/2015-10-03_15620 https://twitter.com/schnarfed/status/650319290500022272

snarfed commented 9 years ago

scratch that, it eventually found the twitter one after all.

snarfed commented 9 years ago

i investigated a bit more tonight, and i made some progress, but only some. the right SyndicatedPost entities are created, and the Response entities also looked ok. my current hunch is that it's specific to checkins. another more recent checkin didn't get responses backfed either, even though its syndication links were also found fine: https://www.facebook.com/212038/posts/10101921412066613 , https://snarfed.org/2015-10-08_15773

snarfed commented 9 years ago

i think i found it. FB posts with photos are actually (at least) two posts with separate ids, one for the post itsel and one for each photo. we've handled this for a while now by canonicalizing on the photo post id, since the normal post's object_id points to it, but there's no pointer from the photo object back to the post.

that's all fine. however, syndication urls are often the post, not the photo, and we don't canonicalize them all the way back to the photo id, so they don't match. we need to do that...specifically, in a new else clause here.

raw notes:

synd url is post: https://www.facebook.com/212038/posts/10101921412126493 (poll log at bottom)

fb api returns photo: https://www.facebook.com/photo.php?fbid=10101921412066613&set=a.725223993143.2289690.212038&type=3

fb api for post has object_id and link pointing to photo: https://developers.facebook.com/tools/explorer?method=GET&path=212038_10101921412126493&version=v2.2

  "link": "https://www.facebook.com/photo.php?fbid=10101921412066613&set=a.725223993143.2289690.212038&type=3",
  "object_id": "10101921412066613",

we canonicalize activity id to photo fbid: https://github.com/snarfed/bridgy/blob/master/facebook.py#L171-L173 https://github.com/snarfed/granary/blob/master/granary/facebook.py#L680

poll that found synd url:

https://www.brid.gy/log?start_time=1444404970&key=aglzfmJyaWQtZ3lyGAsSDEZhY2Vib29rUGFnZSIGMjEyMDM4DA

saving discovered relationship https://www.facebook.com/212038/posts/10101921412126493 -> https://snarfed.org/2015-10-08_15773
discovered relationships {u'https://www.facebook.com/212038/posts/10101921412126493': [SyndicatedPost(key=Key('FacebookPage', '212038', 'SyndicatedPost', 5690572962529280), created=datetime.datetime(2015, 10, 9, 15, 36, 17, 295480), original=u'https://snarfed.org/2015-10-08_15773', syndication=u'https://www.facebook.com/212038/posts/10101921412126493', updated=datetime.datetime(2015, 10, 9, 15, 36, 17, 295490))]}

snarfed commented 9 years ago

cc @dissolve since i know he can commiserate recently about facebook's many different kinds of post urls :P

snarfed commented 9 years ago

OPD shows this in its log, so i think canonicalizing to photo post id (ie object_id) should work:

starting posse post discovery with syndicated https://www.facebook.com/212038/posts/10101921412066613

snarfed commented 9 years ago

i think i was wrong about the root cause. the datastore actually does have SyndicatedPosts mapping the original posts to the photo object ids, e.g. https://snarfed.org/2015-10-08_15773 to https://www.facebook.com/212038/posts/10101921412126493 . entity link.

SyndicatedPosts query, Responses query.

grmph.

notable: the SyndicatedPost with the photo id was created after all four Responses for the photo id.

snarfed commented 9 years ago

scratch that, 10101921412126493 is the post id, not the photo id.

snarfed commented 9 years ago

more importantly, for photo posts, we currently get a response for the post id and a copy of that response on the photo id...which isn't ideal, but it also totally invalidates my theory that we needed to canonicalize syndication urls further.

i fully implemented that, so i'm going to stash it and include the patch here, and then forget about it.

diff --git a/facebook.py b/facebook.py
index b70070a..ba79829 100644
--- a/facebook.py
+++ b/facebook.py
@@ -35,6 +35,7 @@ from granary.source import SELF
 import models
 import util

+from google.appengine.api import memcache
 from google.appengine.ext import ndb
 from google.appengine.ext.webapp import template
 import webapp2
@@ -221,11 +222,26 @@ class FacebookPage(models.Source):

     parsed = urlparse.urlparse(url)
     params = urlparse.parse_qs(parsed.query)
+    url_id = self.gr_source.post_id(url)
+
     ids = params.get('story_fbid') or params.get('fbid')
     if ids:
       url = post_url(ids[0])
-    elif parsed.path.startswith('/notes/'):
-      url = post_url(parsed.path.split('/')[-1])
+    elif url_id:
+      if parsed.path.startswith('/notes/'):
+        url = post_url(url_id)
+      else:
+        # fetch this post from facebook (or memcache) to see if we should
+        # canonicalize to the object_id. corresponds to canonicalization for
+        # photo objects in get_activities() above.
+        cache_key = 'FO %s' % url_id
+        object_id = memcache.get(cache_key)
+        if object_id is None:
+          post = self.get_post(url_id)
+          object_id = post.get('object', {}).get('fb_object_id', '') if post else ''
+          memcache.set(cache_key, object_id)
+        if object_id:
+          url = post_url(object_id)

     username = self.username or self.inferred_username
     if username:
diff --git a/test/test_facebook.py b/test/test_facebook.py
index b6af8ef..6b28ac4 100644
--- a/test/test_facebook.py
+++ b/test/test_facebook.py
@@ -6,6 +6,7 @@ __author__ = ['Ryan Barrett <bridgy@ryanb.org>']
 import app
 import copy
 import datetime
+import logging
 import json
 import re
 import urllib
@@ -236,17 +237,32 @@ class FacebookPageTest(testutil.ModelsTest):
     self.assertEquals(400, cm.exception.code)
     self.assertEquals('not json', cm.exception.body)

+  def expect_canonicalize_syndurl_lookups(self, id, return_id):
+    for id in '212038_' + id, id:
+      self.expect_urlopen(
+        'https://graph.facebook.com/v2.2/%s?access_token=my_token' % id,
+        json.dumps({'id': '0', 'object_id': return_id} if return_id else {}))
+
   def test_canonicalize_syndication_url(self):
+    # should look up each object once, then cache it
+    self.expect_canonicalize_syndurl_lookups('314159', '222')
+    for id in 'snarfed.org', '444':
+      self.expect_canonicalize_syndurl_lookups(id, None)
+    self.mox.ReplayAll()
+
     for expected, input in (
-      ('https://www.facebook.com/212038/posts/314159',
+      ('https://www.facebook.com/212038/posts/222',
+       'http://facebook.com/snarfed.org/posts/314159'),
+      # second time should use memcache instead of fetching object from API
+      ('https://www.facebook.com/212038/posts/222',
        'http://facebook.com/snarfed.org/posts/314159'),
       ('https://www.facebook.com/212038/posts/314159',
        'https://facebook.com/snarfed.org/photos.php?fbid=314159'),
       # note. https://github.com/snarfed/bridgy/issues/429
       ('https://www.facebook.com/212038/posts/314159',
        'https://www.facebook.com/notes/ryan-b/title/314159'),
-      ('https://www.facebook.com/212038/posts/10101299919362973',
-       'https://www.facebook.com/photo.php?fbid=10101299919362973&set=a.995695740593.2393090.212038&type=1&theater'),
+      ('https://www.facebook.com/212038/posts/314159',
+       'https://www.facebook.com/photo.php?fbid=314159&set=a.456.2393090.212038&type=1&theater'),
       ('https://www.facebook.com/212038/posts/314159',
        'https://facebook.com/permalink.php?story_fbid=314159&id=212038'),
       ('https://www.facebook.com/212038/posts/314159',
@@ -255,18 +271,22 @@ class FacebookPageTest(testutil.ModelsTest):
        'https://m.facebook.com/story.php?id=212038&story_fbid=314159'),
       # make sure we don't touch user.name when it appears elsewhere in the url
       ('https://www.facebook.com/25624/posts/snarfed.org',
-       'http://www.facebook.com/25624/posts/snarfed.org')):
+       'http://www.facebook.com/25624/posts/snarfed.org'),
+      ):
+      logging.debug(input)
       self.assertEqual(expected, self.fb.canonicalize_syndication_url(input))

     # username should override inferred username
     self.fb.inferred_username = 'mr-disguise'
-    url = 'https://www.facebook.com/mr-disguise/posts/314159'
-    self.assertEqual(url, self.fb.canonicalize_syndication_url(url))
+    self.assertEqual('https://www.facebook.com/mr-disguise/posts/444',
+                     self.fb.canonicalize_syndication_url(
+                       'https://www.facebook.com/mr-disguise/posts/444'))

     # if no username, fall through
     self.fb.username = None
-    self.assertEqual('https://www.facebook.com/212038/posts/314159',
-                     self.fb.canonicalize_syndication_url(url))
+    self.assertEqual('https://www.facebook.com/212038/posts/444',
+                     self.fb.canonicalize_syndication_url(
+                       'https://www.facebook.com/mr-disguise/posts/444'))

   def test_photo_syndication_url(self):
     """End to end test with syndication URL with FB object id instead of post id.
@@ -297,6 +317,7 @@ class FacebookPageTest(testutil.ModelsTest):

     self.assertNotIn('222', gr_test_facebook.POST['id'])
     self.assertEquals('222', gr_test_facebook.POST['object_id'])
+    self.expect_canonicalize_syndurl_lookups('222', '222')
     self.expect_requests_get('http://my.orig/post', """
     <html class="h-entry">
       <a class="u-syndication" href="https://www.facebook.com/photo.php?fbid=222&set=a.995695740593.2393090.212038&type=1&theater'"></a>
@@ -332,6 +353,8 @@ class FacebookPageTest(testutil.ModelsTest):
     self.assertIsNone(fb.key.get().inferred_username)

     # should infer username
+    self.expect_canonicalize_syndurl_lookups('123', '123')
+    self.mox.ReplayAll()
     syndpost = models.SyndicatedPost.insert(
       self.fb, original='http://fin.al',
       syndication='http://facebook.com/fooey/posts/123')

snarfed commented 9 years ago

ok, maybe i have it this time. it's true that we don't canonicalize syndication urls to photo id. the key additional point is that we canonicalize Response.activity_json.url and .id to photo id, but not Response.key.id. that misled me at first. (it's also why we have duplicate responses.)

ok, so maybe we go with this patch after all, and also canonicalize Response.key.id to de-dupe responses. off to try that now.

snarfed commented 9 years ago

goddammit. that worked, but all the old responses came through with people's names as Bridgy Response. i hate facebook.

snarfed commented 9 years ago

inexplicable. those responses' markup is the same as responses that worked; only ids and the actual name differed. i'm going to ignore this for now.

diff 10101911886940043.html 10101913589088923_10101009518683263.html 
24c24
<   <span class="u-uid">tag:facebook.com,2013:10101911886940043_liked_by_10152298133881113</span>
---
>   <span class="u-uid">tag:facebook.com,2013:10101913589088923_liked_by_10101009518683263</span>
29,30c29,30
<     <div class="p-name"><a class="u-url" href="https://www.facebook.com/10152298133881113">Kristina Lillie</a></div>
<     <img class="u-photo" src="https://graph.facebook.com/v2.2/10152298133881113/picture?type=large" alt="">
---
>     <div class="p-name"><a class="u-url" href="https://www.facebook.com/10101009518683263">Edward Requenez</a></div>
>     <img class="u-photo" src="https://graph.facebook.com/v2.2/10101009518683263/picture?type=large" alt="">
33c33
<   <div class="p-name"><a class="u-url" href="https://www.facebook.com/212038/posts/10101911886940043">likes this.</a></div>
---
>   <div class="p-name"><a class="u-url" href="https://www.facebook.com/212038/posts/10101913589088923">likes this.</a></div>
42,43c42,43
< <a class="u-like u-like-of" href="https://www.facebook.com/212038/posts/10101911886940043"></a>
< <a class="u-like u-like-of" href="https://snarfed.org/2015-10-02_15601"></a>
---
> <a class="u-like u-like-of" href="https://www.facebook.com/212038/posts/10101913589088923"></a>
> <a class="u-like u-like-of" href="https://snarfed.org/2015-10-03_15609"></a>

snarfed commented 9 years ago

note to self : also need to skip non-facebook.com URLs in FacebookPage.canonicalize_syndication_urls.

snarfed commented 9 years ago

ok. i think this is all done except de-duping the Response key ids, and i think that will be easier to do during #404, so i'm going to close this. i think. fingers crossed! :beers: :sleeping:

snarfed commented 9 years ago

reopening. this introduces a lot of extra serial FB API calls for sites that regularly syndicate to FB. the memcaching helps, but only until the cache entries expire, which is currently every 6h or so. e.g. this poll spent ~60s on 230 (!) serial FB object fetches. 2 per syndication link, so evidently 115 syndication links. yow.

snarfed commented 9 years ago

options:

drop this feature. not ideal, obviously.
store resolved object_ids in the datastore
store both pre- and post-resolve syndication urls in SyndicatedPosts, and only resolve ones we haven't seen before
~~when we canonicalize urls from activities, reuse the activity's fb_object_id since we already have it.~~ damn, doesn't work since we don't keep fb_* fields when we prune stored activities. i've changed that, but it's somewhat too late. :/
...?

most of these depend on facebook ids being immutable once they're allocated to posts and associated between them, which i'm going to blindly assume is true.

snarfed commented 9 years ago

worth noting: those 230 fetches were all during reprocessing old responses. so if we had kept fb_object_id in the pruned activities, we could have used that to avoid them.

snarfed commented 9 years ago

next TODO: handle and ignore 4xxes from the FB object fetch.

snarfed commented 9 years ago

woo, finally got the caching working. i think i can finally close this now. yay.

snarfed commented 9 years ago

reopening. :/ @aaronpk's most recent photo post id has a colon in it, 10102229448035176:0, which we don't handle here yet. details on colon ids in #305.

FB post URLs:

https://www.facebook.com/aaronpk/posts/10102229448035176:0 (his syndication URL)
https://www.facebook.com/11500459/posts/10102229448035176:0 (our normalized syndication URL)
https://www.facebook.com/photo.php?fbid=10102229448035176&set=a.508859459516.2102581.11500459&type=3 (photo object)

original post: http://aaronparecki.com/notes/2015/10/23/1/mrpukey

aaronpk commented 9 years ago

haha sorry!

snarfed / bridgy

facebook: resolve post vs photo ids in syndication urls #498