URL-only posts from Twitter fail fuzzy matching and ricochet around social networks

stephensekula commented 7 years ago

A recent event highlighted a serious flaw in the way that fuzzy text matching and message ID matching can be evaded in the current architecture of the bridge.

Let's say we have a URL-only message that appears in Twitter:

https://github.com/stephensekula/navierstokes

No other text. This is retrieved by NavierStokes using python-twitter, and the text comes over unicode-formatted but not in HTML format. TwitterTools now takes the text and "HTML-izes" the URLs in a post, so that when they go to other networks that respect HTML they are nicely formatted, with an active hyperlink, etc.

However, this creates a flaw that allows messages to evade fuzzy text matching and message ID matching.

Let's focus on message ID matching. Each message on each social network has a unique identifier (UID) code assigned to them. It's a different format for each network, but that doesn't matter - the idea is that the combination of recording this ID and using fuzzy text matching should catch repeat posts and prevent them from being circulated over and over again between networks.

So the above URL-only message comes from Twitter, and let's say Twitter assigns it a unique ID of abcd1234. So here is what this message would look like if we stored only its content and id:

message.content = <a href="https://github.com/stephensekula/navierstokes">https://github.com/stephensekula/navierstokes</a>
message.id = abcd1234

So NavierStokes then goes and checks other networks (let's assume for now we're just bridging between Twitter, GNU Social, and Pump.io). Since this is the first time the message appears from Twitter into NavierStokes, its ID is not recorded as having already been written to these other networks.

Let's now consider fuzzy text matching. To do the fuzzy text match, and to avoid the different URL formats of different networks (e.g. Twitter sends things using its URL shortner, t.co, while there is no such nonsense from GNU Social, Pump.io, etc.) from creating a false negative match, URLs are stripped from messages before they are compared between two networks. So in the above case, our message is transformed from this:

<a href="https://github.com/stephensekula/navierstokes">https://github.com/stephensekula/navierstokes</a>

to this:

<a href=""></a>

This is obviously already not ideal. We'd really want messages like this to not even be considered for sharing between networks, since there is SO little to actually go on in matching. However, the string is not empty - it still contains HTML - and so it's considered for fuzzy matching. Let's assume it fails to match, so NavierStokes considers this a unique and new message that should be bridged between networks.

GNU Social gets a copy and assigns it an ID of 555, while Pump.io gets it an assigns an ID of eeee9999.

The next time NavierStokes is run, it now harvests these messages AGAIN from the three networks. From Twitter, we get this again:

message.content = <a href="https://github.com/stephensekula/navierstokes">https://github.com/stephensekula/navierstokes</a>
message.id = abcd1234

while from GNU Social we get:

message.content = <a href="https://github.com/stephensekula/navierstokes">https://github.com/stephensekula/navierstokes</a>
message.id = 555

and from Pump.io:

message.content = <a href="https://github.com/stephensekula/navierstokes">https://github.com/stephensekula/navierstokes</a>
message.id = eeee9999

Of course, the message from Twitter has already been recorded as having been written to the other two networks, so it is not passed on in this round. However, when the GNU Social version of the message is compared to the Pump.io version of the message, NavierStokes says that based on ID that Pump.io has not seen this message from GNU Social, and GNU Social has not seen this message from Pump.io.

So here we are relying on fuzzy matching to save us.

But here it fails, and the failure mode is not yet clear. For instance, from the log files, I find this:

2017-07-17 03:31:03,652 https://t.co/GovEqhCJZZ
2017-07-17 03:31:03,652      BEST MATCH ON twitter: 0.000000  ____ <a href="https://t.co/GovEqhCJZZ">https://t.co/GovEqhCJZZ</a>

So despite the fact that the URL-only message did have a "best match" on another network, that match had a score of 0.000000 from fuzzywuzzy... which is super-bad, since this evades the required match threshold and appears to be a totally new message! But it clearly is not.

If I had NOT stripped URLs out of the message, then it definitely would have matched, but the fact that one has HTML in it and the other does not causes all the trouble.

This needs to be addressed.

stephensekula commented 7 years ago

One idea: a maximum value vote.

Fuzzy-match with the URLs left in, and fuzzy match without the URLs left in. Take the larger of the two scores, and see whether that lies above the threshold. If it does, stop the bridging of the message between the two networks and move on. I will try this.

stephensekula commented 7 years ago

A bugfix is available: https://github.com/stephensekula/navierstokes/commit/a0479d42199e78180e0d520f9945ac1564c83fa8

In short:

The fuzzy text matching code is now concentrated into a function in NavierStokes.py
BeautifulSoup is used to clean HTML out of the messages, to compare their pure text content
Two fuzzy matching scores are computed: one with HTML/URLs left in, one with those stripped. The maximum of either is used as the final matching score.
In addition, zero-length unicode strings were failing the isspace() check (isspace() is nonzero for these!). This was probably the major contributor to this failure.
So now, the LENGTH of the unicode string is used as an additional fail-safe check. If length is < 3 characters, ABORT and never share this post.

stephensekula commented 7 years ago

I have run several tests, and the fix seems to hold. URL-only posts are NOT shared between networks now. I have tagged this (https://github.com/stephensekula/navierstokes/releases/tag/v1.1.5) and will close this issue for now.

stephensekula / navierstokes

URL-only posts from Twitter fail fuzzy matching and ricochet around social networks #6