ryanjgallagher / focalevents

Tools for collecting social media data around focal events
MIT License
84 stars 15 forks source link

Allow for multiple quotes of quotes searches #1

Open ryanjgallagher opened 3 years ago

ryanjgallagher commented 3 years ago

Currently, you can get quotes of quote tweets, but there's no efficient way to continue iterating that process because everything gets labeled as from_quote_search. Two changes can be made:

  1. Add a quote_level column to the database, so that you can subset by quote tweets which iteration of a quote search they were returned from. For example, tweets retrieved from a search are quote_level 0. Quotes of those tweets are quote_level 1. Quotes of quote_level 1 tweets are quote_level 2. And so on. This involves updating config.py and the insertions into the tweets database in search.py and helper.py
  2. Allow a user to either iterate on the previous quote level (identified automatically) or iterate up to a certain depth, e.g. the user specifies something like up_to_quote_level=6 and the search automatically gets quotes from levels 1 to 6 automatically
ryanjgallagher commented 3 years ago

One thing to keep in mind: a tweet can end up in multiple quote levels. For example, say a search returns a quote of a tweet where both the original tweet and the quote match the given query. Then the quote tweet is quote_level 0 because it's in the original search. However, when getting quotes (quote_level 1), that quote tweet will be returned because it quotes the original tweet (which was a query match itself). Further though because the quote tweet is also quote_level 0, any tweets that quote the quote tweet will be returned. This will lead to any inefficiency because the quote tweet becomes quote_level 1 along with all of the tweets that quoted it, even though they should be one level up. If this isn't handled, it can defeat the purpose of adding a quote_level column for efficiency.

At the time of doing the search, you can't efficiently check if a quoted tweet has already been seen; similarly during a stream you don't know if the quoted or quoting tweet is the match. So before you start a quote search (even just a regular one), you should first update all of the quote levels of any quote tweets if the tweet they're quoting also came directly from the search or stream.

It's possible you might run into similar issues based on how people do circular replies and quotes. So think about if that would affect any circular / infinite queries that could happen.

ryanjgallagher commented 3 years ago

Can get rid of get_quotes_of_quotes and just let user specify the quote_level. Default to quote_level 1