travisbrown / cancel-culture

Tools for fighting abuse on Twitter
Mozilla Public License 2.0
413 stars 28 forks source link

Investigate ways to improve deleted-tweets coverage #64

Open travisbrown opened 2 years ago

travisbrown commented 2 years ago

I'm doing a review of content that's missed by the deleted-tweets command provided in twcc, and it suggests some things that could be improved.

For example, twcc currently finds 861 deleted tweets for right-wing troll Andy Ngo, but other datasets I've collected (also derived from the Wayback Machine) turn up an additional 134 deleted tweets that twcc does not find. You can find the text of those additional tweets here.

There seems to be (at least) three reasons that these are missed. The first reason is that twcc is intended to be used as a standalone tool that generally takes a few minutes to run, so it can only look for content that is directly indexed in the Wayback Machine. For example, the earliest missing deleted Ngo tweet is a reply from this archived page. Ngo's tweet was never archived by the Wayback Machine directly before it was deleted, only in replies to another user (@/1lb_cake, who is now suspended), so twcc doesn't know it exists.

It's impossible to find much of this stray content without a lot of custom data collection and indexing, so twcc will always have some gaps. twcc does support using a local store as a cache for snapshots across invocations, though, and it doesn't currently search all content in that store (only the directly indexed snapshots). It wouldn't be too hard to have it check all of the contents of the local store, so that for example if the user had explicitly collected data for @/1lb_cake, the tool would find the missing Ngo tweets in a subsequent run for @/MrAndyNgo.

The second reason for missing content is that the parser twcc uses to extract data from the HTML snapshots doesn't handle some pages. One example of a format it doesn't handle is tracked in #60, but there could be others. We need to look in more detail at e.g. the CSV list above to see if there are other examples of this issue.

The third reason is that in some cases a tweet may be archived multiple times, with some of the snapshots being broken, and twcc can fail to retry others after running into a broken one. In Ngo's most recent missing deleted tweet, for example, twcc hit this bad snapshot and didn't try either of the other two. This should be the easiest issue to fix, but my impression is that it only accounts for a very small part of the missing content.