Improve performance of (custom) dataset extraction

gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.

MIT License

25 stars 2 forks source link

Improve performance of (custom) dataset extraction #84

Open dolsysmith opened 3 years ago

dolsysmith commented 3 years ago

Some notes:

tasks.py (line 42) uses the search.scan iterator from elasticsearch_dsl. The default size parameter retrieves batches of 1000 results. Maybe a larger value for size would result in greater efficiencies.
The scan API also takes a scroll parameter, set by default to '5min'. According to the documentation, this parameter "specifi[ies] how long a consistent view of the index should be maintained for scrolled search." If our data extraction tasks are taking much longer than 5 minutes (as it seems), does this pose any problems?

dolsysmith commented 3 years ago

Possible future workflow

If the dataset extract consists of less than N Tweets

Launch a single Celery task
Retrieve the Tweet ID’s and other fields per doc
If mentions, top mentions, and/or top users are requested, create a pandas DataFrame and use DataFrame aggregations to compute these and write to CSV

Otherwise

Chain multiple Celery tasks, one for each of the following (as needed)
Retrieve Tweets and write Tweet ID’s / Tweet JSON
Compute mentions and write CSV (current method)
Compute top mentions and write CSV (ES query)
Compute top users and write CSV (ES query or current method)

Write full Tweet CSV

lwrubel commented 3 years ago

Deferred in favor of #89.

dolsysmith commented 3 years ago

One possible approach to making custom extracts more fault tolerant would be to use the search-after functionality in Elasticsearch to retrieve results (as opposed to the Scroll API). This approach requires that results be sorted on a unique identifier -- I think tweet_id would work for this -- and it allows for pagination based on the last-seen (sorted) result. Reading the description, I wonder if we could use this to resume extracts that get disrupted, i.e., by setting the search_after parameter to the last result written to disk before the job was interrupted.

FWIW, this functionality is recommended for "deep pagination" in lieu of the Scroll API in more recent versions of Elasticsearch.

dolsysmith commented 3 years ago

On random sampling in Elasticsearch: it is possible to create a random score, which seems like it might be useful: the function provides a uniformly distributed score between 0 and 1. In order to get N percentage of the documents, we could take all documents with a score above 1 - N. (For example, if we want 25% of the documents, we should take documents with a randomized score >= .75.)