search all silo posts for links to users' sites and send mentions

snarfed commented 9 years ago

spun out of #51. from https://github.com/snarfed/bridgy/issues/51#issuecomment-135816838:

an idea for expanding this: search silos for any posts, from anyone, that link to the user's domain(s), and send wms for them too. these are effectively mentions.

silo support for this is mixed:

Twitter: /search/tweets.json?q=
G+: /activities?query=
FB: no. search was removed in API v2.0. you can search over other things, including events, but mentions in event descriptions are a small minority case.
IG: no. can't find any full text search support in their API reference docs.
Flickr: ...?

snarfed commented 9 years ago

cc @kylewm in case you're interested in adding flickr search support... (see above)

snarfed commented 9 years ago

the remaining part here is to send mention posts themselves, not just their responses. this needs a new post response type connected to the post mf2 handler.

snarfed commented 8 years ago

finally soft launched this, and it worked well, but evidently has a memory leak, so i had to roll it back.

Exceeded soft private memory limit of 256 MB with 328 MB after servicing 2 requests total.

ugh.

there's FUD here and there about the sockets API maybe causing memory leaks due to badly handled range requests, but i can't tell how real it is or if it could be causing this. i suspect i've just been wasteful with memory, e.g. lots of string concatenations and copy.deepcopys, and it's finally time to pay the piper. whee, can't wait to heap profile. :sob:

silver lining: at least i know the window of commits where the leak was introduced!

snarfed commented 8 years ago

the little orange bump of 500s here is our instances flapping (OOMing, restarting, and OOMing again):

chart

here's a snippet of individual requests at peak flap. the red !!! ones are OOMs. not pretty!

snarfed commented 8 years ago

silver lining: it's working ok, at least! e.g. the top response here: https://www.brid.gy/twitter/kylewmahan#responses is this tweet: https://twitter.com/anarcho/status/643921641664200704 which propagated as a mention to https://kylewm.com/2015/09/repost-of-glenn-greenwald-the-new-revolving-door

kylewm commented 8 years ago

wow, that mention is hidden behind a redirect too, pretty cool!

snarfed commented 8 years ago

for the record: who's the dunce who sprinkled copy.deepcopys throughout poll, basically bridgy's inner loop, and then acted all surprised when it blew our memory budget? this guy!!! :P

snarfed commented 8 years ago

ok, i think it might stick this time. monitoring graphs below. i turned it on for just @kylewm and me at the 1hr (ago) point, for 6 more accounts at 45m, and for everyone at 30m. ran out of memory once, largely due to polling @kevinmarks a few times in rapid succession (he's prolific), but that's it. and we hit that cap occasionally anyway, so i'm not too worried.

diplix commented 8 years ago

i love it. it actually collects all tweets containing links to my articles. looks great, too.

thanks a lot for this, it’s a great new feature!

snarfed commented 8 years ago

thanks for the kind words!

snarfed commented 8 years ago

this has noticeably increased our poll latency:

the poll task queue is now ~90m behind. not a big deal, but definitely not ideal. hrmph. time to profile i guess.

snarfed commented 8 years ago

some of this might be just because our slow poll frequency is once a day, so we're still working through the first set of search results for many users. that should be done by around noon PST. i'll revisit if latency is still consistently bad after that.

snarfed commented 8 years ago

scratch that, we'll be caught up by ~1:30pm PST today, since we're ~90m behind. math!

snarfed commented 8 years ago

poll latency is looking better now. averaging 5-10s, higher than ~4s before, but still reasonable.

snarfed commented 8 years ago

the poll queue is still behind by 45m :/, but i'm hoping some of that was due to #490. i pushed out a change there (1ebfe1cf0d3c6675d9f8291434dc50e3fba2c39a) a few hours ago that adds a bunch of shortlink generator domains to the blacklist and checks the blacklist before searching for a domain, so i'm hoping that will help some too.

snarfed commented 8 years ago

tentatively closing. this has been running in prod and stable for a few days. I'm sure there are more bugs left to fix, but we can open new issues for them.

singpolyma commented 8 years ago

Does brid.gy also turn @ mentions to my twitter username to webmentions to my domain? That would be similar to this and very nice

snarfed commented 8 years ago

@singpolyma not right now, but that's an interesting feature request. just to confirm, you're proposing they'd be sent to your front page, e.g. target=https://singpolyma.net/?

singpolyma commented 8 years ago

@snarfed yes. or whatever URL is on my twitter profile

snarfed commented 8 years ago

i currently craft search queries by stripping scheme (ie http://), putting quotes around the remaining domain and path, and ORing all of those together, e.g. "snarfed.org" OR "instagram.com/snarfed". sadly, this has been returning both false positive and false negatives in both G+ and Twitter. :/

i added the scheme back to G+ searches in 485af7323352ef9840c962c090dd7598fe9f8d53, and it looks like that cut out the false positives but didn't add any false negatives.

still working on Twitter. here's some research so far for the example domain hypothes.is, including links to searches:

hypothes.is returns similar usernames and word variations, e.g. _@hypothesis and hypothesis is
"hypothes.is" (our current approach) is better, but still returns _@hypothesis
https://hypothes.is and https://hypothes.is/ (trailing slash) only return links to the home page. same with "https://hypothes.is" and "https://hypothes.is/"

hrmph.

snarfed commented 8 years ago

i'm now thinking about still using the "hypothes.is" style search for twitter and filtering out the false positives manually.

snarfed commented 8 years ago

discussion in IRC.

singpolyma commented 8 years ago

Filtering false positives seems like an essential thing to do. Trying to get as much as possible is probably the best, then filter after

snarfed commented 8 years ago

i wish! sadly many users' domains are common words, or have common words in them, so their false positive rate can be 1K:1 or even 1M:1 for domains with words like blog or web. :/ and bridgy is approaching 1k twitter users, so I'd like to try to cut down that workload (and cost) a bit.

singpolyma commented 8 years ago

filter out common words and only search for the unique part maybe?

snarfed commented 8 years ago

oh boy, and now i'm in the business of maintaining a stop word list and search query rewriter. :P you're definitely right, it's doable, i'm just not sure i want to take that plunge...

singpolyma commented 8 years ago

Sorry. Was a thought

snarfed commented 8 years ago

np! definitely appreciated. :two_men_holding_hands:

snarfed / bridgy

search all silo posts for links to users' sites and send mentions #456