michenriksen / birdwatcher

Data analysis and OSINT framework for Twitter
https://michenriksen.com/blog/birdwatcher-twitter-osint-framework/
MIT License
410 stars 65 forks source link

Ideas for modules that help identify bot accounts #8

Open michael-myers opened 7 years ago

michael-myers commented 7 years ago

I had a few ideas for a Birdwatcher module or set of modules that might be used to automatically identify bot accounts, including the ones used for political propaganda and influence operations.

  1. Activity log analysis: whether it conforms to a "normal" Twitter activity profile (an obvious wake/sleep schedule) or appears to continuously operate at any hour of the day; if there is an activity rhythm, whether it corresponds to the time zone of the location listed in the account's profile (or geotagged tweets).
  2. Reverse-image search (via a Google API?) on profile photos: bots, if they use profile images at all, use images lifted from public photos others have posted. Google's API might be pretty good about matching cropped photos to original content.
  3. Twitter client metadata: capture the metadata for which Twitter client was used to post the tweets. Often times automated accounts will make use of automation services like Dlvr.it and smqueue. It's not conclusive proof of anything but it's a factor for suspicion. Within a network of botted accounts, they will share the same automation service.
  4. There are behavioral clues in a Twitter account's use of Likes/Favorites and Retweets. Not using these features at all would be a highly suspicious indicator. Using Likes to boost the signal of the other bots in your network is also abused to increase a particular tweet in the Trending algorithm, so it ought to be another indicator if you can find a highly self-referential pattern of likes within a subset of Twitter accounts.
  5. Identical tweets. If tweet/status content could be hashed after being fetched, then all tweets with a matching hash could be shown easily. Bot networks often simultaneously tweet the same messages from their botnet master, without using the retweet feature. If two or more accounts post a status with timestamps within a given range, and matching hashes, they can be assumed to be owned by the same individual.
  6. A Georgia Tech researcher is analyzing language for credibility. This is a bit like the existing module for sentiment analysis: experimental. Definitely early work but maybe he could be invited to integrate his work with Birdwatcher.
michael-myers commented 7 years ago

Regarding bullet point 2 above, I looked into doing reverse image search using an API, and it is not going to be very feasible. The existing services that offer reverse image search (Google, TinEye, Incandescent aka ImageRaider) charge a minimum of 4¢ per query, with a minimum order of $200 USD. Google apparently detects and blocks automated queries of this kind.

Regarding bullet point 4, it seems there has been parallel work on this problem in the framework of academic "citation cartels." This is where academics publish papers and trade citations with other academics in a non-obvious way designed to boost their ranking. Two proposed solutions exist which might translate to detecting social media bot behavior.