gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

Improve performance of (custom) dataset extraction #84

Open dolsysmith opened 3 years ago

dolsysmith commented 3 years ago

Some notes:

dolsysmith commented 3 years ago

Possible future workflow

If the dataset extract consists of less than N Tweets

Otherwise

Write full Tweet CSV

lwrubel commented 3 years ago

Deferred in favor of #89.

dolsysmith commented 3 years ago

One possible approach to making custom extracts more fault tolerant would be to use the search-after functionality in Elasticsearch to retrieve results (as opposed to the Scroll API). This approach requires that results be sorted on a unique identifier -- I think tweet_id would work for this -- and it allows for pagination based on the last-seen (sorted) result. Reading the description, I wonder if we could use this to resume extracts that get disrupted, i.e., by setting the search_after parameter to the last result written to disk before the job was interrupted.

FWIW, this functionality is recommended for "deep pagination" in lieu of the Scroll API in more recent versions of Elasticsearch.

dolsysmith commented 3 years ago

On random sampling in Elasticsearch: it is possible to create a random score, which seems like it might be useful: the function provides a uniformly distributed score between 0 and 1. In order to get N percentage of the documents, we could take all documents with a score above 1 - N. (For example, if we want 25% of the documents, we should take documents with a randomized score >= .75.)