Closed snarfed closed 9 years ago
ugh, now this one too. oddly twitter, not FB.
https://snarfed.org/2015-10-03_15620 https://twitter.com/schnarfed/status/650319290500022272
scratch that, it eventually found the twitter one after all.
i investigated a bit more tonight, and i made some progress, but only some. the right SyndicatedPost
entities are created, and the Response
entities also looked ok. my current hunch is that it's specific to checkins. another more recent checkin didn't get responses backfed either, even though its syndication links were also found fine: https://www.facebook.com/212038/posts/10101921412066613 , https://snarfed.org/2015-10-08_15773
i think i found it. FB posts with photos are actually (at least) two posts with separate ids, one for the post itsel and one for each photo. we've handled this for a while now by canonicalizing on the photo post id, since the normal post's object_id
points to it, but there's no pointer from the photo object back to the post.
that's all fine. however, syndication urls are often the post, not the photo, and we don't canonicalize them all the way back to the photo id, so they don't match. we need to do that...specifically, in a new else
clause here.
raw notes:
synd url is post: https://www.facebook.com/212038/posts/10101921412126493 (poll log at bottom)
fb api returns photo: https://www.facebook.com/photo.php?fbid=10101921412066613&set=a.725223993143.2289690.212038&type=3
fb api for post has object_id and link pointing to photo: https://developers.facebook.com/tools/explorer?method=GET&path=212038_10101921412126493&version=v2.2
"link": "https://www.facebook.com/photo.php?fbid=10101921412066613&set=a.725223993143.2289690.212038&type=3",
"object_id": "10101921412066613",
we canonicalize activity id to photo fbid: https://github.com/snarfed/bridgy/blob/master/facebook.py#L171-L173 https://github.com/snarfed/granary/blob/master/granary/facebook.py#L680
poll that found synd url:
https://www.brid.gy/log?start_time=1444404970&key=aglzfmJyaWQtZ3lyGAsSDEZhY2Vib29rUGFnZSIGMjEyMDM4DA
saving discovered relationship https://www.facebook.com/212038/posts/10101921412126493 -> https://snarfed.org/2015-10-08_15773
discovered relationships {u'https://www.facebook.com/212038/posts/10101921412126493': [SyndicatedPost(key=Key('FacebookPage', '212038', 'SyndicatedPost', 5690572962529280), created=datetime.datetime(2015, 10, 9, 15, 36, 17, 295480), original=u'https://snarfed.org/2015-10-08_15773', syndication=u'https://www.facebook.com/212038/posts/10101921412126493', updated=datetime.datetime(2015, 10, 9, 15, 36, 17, 295490))]}
cc @dissolve since i know he can commiserate recently about facebook's many different kinds of post urls :P
OPD shows this in its log, so i think canonicalizing to photo post id (ie object_id
) should work:
starting posse post discovery with syndicated https://www.facebook.com/212038/posts/10101921412066613
i think i was wrong about the root cause. the datastore actually does have SyndicatedPost
s mapping the original posts to the photo object ids, e.g. https://snarfed.org/2015-10-08_15773 to https://www.facebook.com/212038/posts/10101921412126493 . entity link.
SyndicatedPost
s query, Response
s query.
grmph.
notable: the SyndicatedPost
with the photo id was created after all four Response
s for the photo id.
scratch that, 10101921412126493 is the post id, not the photo id.
more importantly, for photo posts, we currently get a response for the post id and a copy of that response on the photo id...which isn't ideal, but it also totally invalidates my theory that we needed to canonicalize syndication urls further.
i fully implemented that, so i'm going to stash it and include the patch here, and then forget about it.
diff --git a/facebook.py b/facebook.py
index b70070a..ba79829 100644
--- a/facebook.py
+++ b/facebook.py
@@ -35,6 +35,7 @@ from granary.source import SELF
import models
import util
+from google.appengine.api import memcache
from google.appengine.ext import ndb
from google.appengine.ext.webapp import template
import webapp2
@@ -221,11 +222,26 @@ class FacebookPage(models.Source):
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
+ url_id = self.gr_source.post_id(url)
+
ids = params.get('story_fbid') or params.get('fbid')
if ids:
url = post_url(ids[0])
- elif parsed.path.startswith('/notes/'):
- url = post_url(parsed.path.split('/')[-1])
+ elif url_id:
+ if parsed.path.startswith('/notes/'):
+ url = post_url(url_id)
+ else:
+ # fetch this post from facebook (or memcache) to see if we should
+ # canonicalize to the object_id. corresponds to canonicalization for
+ # photo objects in get_activities() above.
+ cache_key = 'FO %s' % url_id
+ object_id = memcache.get(cache_key)
+ if object_id is None:
+ post = self.get_post(url_id)
+ object_id = post.get('object', {}).get('fb_object_id', '') if post else ''
+ memcache.set(cache_key, object_id)
+ if object_id:
+ url = post_url(object_id)
username = self.username or self.inferred_username
if username:
diff --git a/test/test_facebook.py b/test/test_facebook.py
index b6af8ef..6b28ac4 100644
--- a/test/test_facebook.py
+++ b/test/test_facebook.py
@@ -6,6 +6,7 @@ __author__ = ['Ryan Barrett <bridgy@ryanb.org>']
import app
import copy
import datetime
+import logging
import json
import re
import urllib
@@ -236,17 +237,32 @@ class FacebookPageTest(testutil.ModelsTest):
self.assertEquals(400, cm.exception.code)
self.assertEquals('not json', cm.exception.body)
+ def expect_canonicalize_syndurl_lookups(self, id, return_id):
+ for id in '212038_' + id, id:
+ self.expect_urlopen(
+ 'https://graph.facebook.com/v2.2/%s?access_token=my_token' % id,
+ json.dumps({'id': '0', 'object_id': return_id} if return_id else {}))
+
def test_canonicalize_syndication_url(self):
+ # should look up each object once, then cache it
+ self.expect_canonicalize_syndurl_lookups('314159', '222')
+ for id in 'snarfed.org', '444':
+ self.expect_canonicalize_syndurl_lookups(id, None)
+ self.mox.ReplayAll()
+
for expected, input in (
- ('https://www.facebook.com/212038/posts/314159',
+ ('https://www.facebook.com/212038/posts/222',
+ 'http://facebook.com/snarfed.org/posts/314159'),
+ # second time should use memcache instead of fetching object from API
+ ('https://www.facebook.com/212038/posts/222',
'http://facebook.com/snarfed.org/posts/314159'),
('https://www.facebook.com/212038/posts/314159',
'https://facebook.com/snarfed.org/photos.php?fbid=314159'),
# note. https://github.com/snarfed/bridgy/issues/429
('https://www.facebook.com/212038/posts/314159',
'https://www.facebook.com/notes/ryan-b/title/314159'),
- ('https://www.facebook.com/212038/posts/10101299919362973',
- 'https://www.facebook.com/photo.php?fbid=10101299919362973&set=a.995695740593.2393090.212038&type=1&theater'),
+ ('https://www.facebook.com/212038/posts/314159',
+ 'https://www.facebook.com/photo.php?fbid=314159&set=a.456.2393090.212038&type=1&theater'),
('https://www.facebook.com/212038/posts/314159',
'https://facebook.com/permalink.php?story_fbid=314159&id=212038'),
('https://www.facebook.com/212038/posts/314159',
@@ -255,18 +271,22 @@ class FacebookPageTest(testutil.ModelsTest):
'https://m.facebook.com/story.php?id=212038&story_fbid=314159'),
# make sure we don't touch user.name when it appears elsewhere in the url
('https://www.facebook.com/25624/posts/snarfed.org',
- 'http://www.facebook.com/25624/posts/snarfed.org')):
+ 'http://www.facebook.com/25624/posts/snarfed.org'),
+ ):
+ logging.debug(input)
self.assertEqual(expected, self.fb.canonicalize_syndication_url(input))
# username should override inferred username
self.fb.inferred_username = 'mr-disguise'
- url = 'https://www.facebook.com/mr-disguise/posts/314159'
- self.assertEqual(url, self.fb.canonicalize_syndication_url(url))
+ self.assertEqual('https://www.facebook.com/mr-disguise/posts/444',
+ self.fb.canonicalize_syndication_url(
+ 'https://www.facebook.com/mr-disguise/posts/444'))
# if no username, fall through
self.fb.username = None
- self.assertEqual('https://www.facebook.com/212038/posts/314159',
- self.fb.canonicalize_syndication_url(url))
+ self.assertEqual('https://www.facebook.com/212038/posts/444',
+ self.fb.canonicalize_syndication_url(
+ 'https://www.facebook.com/mr-disguise/posts/444'))
def test_photo_syndication_url(self):
"""End to end test with syndication URL with FB object id instead of post id.
@@ -297,6 +317,7 @@ class FacebookPageTest(testutil.ModelsTest):
self.assertNotIn('222', gr_test_facebook.POST['id'])
self.assertEquals('222', gr_test_facebook.POST['object_id'])
+ self.expect_canonicalize_syndurl_lookups('222', '222')
self.expect_requests_get('http://my.orig/post', """
<html class="h-entry">
<a class="u-syndication" href="https://www.facebook.com/photo.php?fbid=222&set=a.995695740593.2393090.212038&type=1&theater'"></a>
@@ -332,6 +353,8 @@ class FacebookPageTest(testutil.ModelsTest):
self.assertIsNone(fb.key.get().inferred_username)
# should infer username
+ self.expect_canonicalize_syndurl_lookups('123', '123')
+ self.mox.ReplayAll()
syndpost = models.SyndicatedPost.insert(
self.fb, original='http://fin.al',
syndication='http://facebook.com/fooey/posts/123')
ok, maybe i have it this time. it's true that we don't canonicalize syndication urls to photo id. the key additional point is that we canonicalize Response.activity_json.url
and .id
to photo id, but not Response.key.id
. that misled me at first. (it's also why we have duplicate responses.)
ok, so maybe we go with this patch after all, and also canonicalize Response.key.id
to de-dupe responses. off to try that now.
goddammit. that worked, but all the old responses came through with people's names as Bridgy Response. i hate facebook.
inexplicable. those responses' markup is the same as responses that worked; only ids and the actual name differed. i'm going to ignore this for now.
diff 10101911886940043.html 10101913589088923_10101009518683263.html
24c24
< <span class="u-uid">tag:facebook.com,2013:10101911886940043_liked_by_10152298133881113</span>
---
> <span class="u-uid">tag:facebook.com,2013:10101913589088923_liked_by_10101009518683263</span>
29,30c29,30
< <div class="p-name"><a class="u-url" href="https://www.facebook.com/10152298133881113">Kristina Lillie</a></div>
< <img class="u-photo" src="https://graph.facebook.com/v2.2/10152298133881113/picture?type=large" alt="">
---
> <div class="p-name"><a class="u-url" href="https://www.facebook.com/10101009518683263">Edward Requenez</a></div>
> <img class="u-photo" src="https://graph.facebook.com/v2.2/10101009518683263/picture?type=large" alt="">
33c33
< <div class="p-name"><a class="u-url" href="https://www.facebook.com/212038/posts/10101911886940043">likes this.</a></div>
---
> <div class="p-name"><a class="u-url" href="https://www.facebook.com/212038/posts/10101913589088923">likes this.</a></div>
42,43c42,43
< <a class="u-like u-like-of" href="https://www.facebook.com/212038/posts/10101911886940043"></a>
< <a class="u-like u-like-of" href="https://snarfed.org/2015-10-02_15601"></a>
---
> <a class="u-like u-like-of" href="https://www.facebook.com/212038/posts/10101913589088923"></a>
> <a class="u-like u-like-of" href="https://snarfed.org/2015-10-03_15609"></a>
note to self : also need to skip non-facebook.com URLs in FacebookPage.canonicalize_syndication_urls
.
ok. i think this is all done except de-duping the Response
key ids, and i think that will be easier to do during #404, so i'm going to close this. i think. fingers crossed! :beers: :sleeping:
reopening. this introduces a lot of extra serial FB API calls for sites that regularly syndicate to FB. the memcaching helps, but only until the cache entries expire, which is currently every 6h or so. e.g. this poll spent ~60s on 230 (!) serial FB object fetches. 2 per syndication link, so evidently 115 syndication links. yow.
options:
object_id
s in the datastoreSyndicatedPost
s, and only resolve ones we haven't seen beforefb_object_id
since we already have it.fb_*
fields when we prune stored activities. i've changed that, but it's somewhat too late. :/most of these depend on facebook ids being immutable once they're allocated to posts and associated between them, which i'm going to blindly assume is true.
worth noting: those 230 fetches were all during reprocessing old responses. so if we had kept fb_object_id
in the pruned activities, we could have used that to avoid them.
next TODO: handle and ignore 4xxes from the FB object fetch.
woo, finally got the caching working. i think i can finally close this now. yay.
reopening. :/ @aaronpk's most recent photo post id has a colon in it, 10102229448035176:0
, which we don't handle here yet. details on colon ids in #305.
FB post URLs:
original post: http://aaronparecki.com/notes/2015/10/23/1/mrpukey
haha sorry!
bridgy successfully found and used the twitter and instagram syndication urls for my last two posts, but not the facebook ones.
https://snarfed.org/2015-10-02_15601 https://snarfed.org/2015-10-02_15607
they were originally m.facebook.com urls like this, which it didn't seem to find at all.
https://m.facebook.com/story.php?story_fbid=10101911911520783&id=212038
i later changed them to www links like this, which it found but hasn't actually connected to the responses.
https://www.facebook.com/212038/posts/10101911911520783
i hate facebook.