Open dolsysmith opened 3 years ago
If the dataset extract consists of less than N Tweets
Otherwise
Write full Tweet CSV
Deferred in favor of #89.
One possible approach to making custom extracts more fault tolerant would be to use the search-after functionality in Elasticsearch to retrieve results (as opposed to the Scroll API). This approach requires that results be sorted on a unique identifier -- I think tweet_id
would work for this -- and it allows for pagination based on the last-seen (sorted) result. Reading the description, I wonder if we could use this to resume extracts that get disrupted, i.e., by setting the search_after
parameter to the last result written to disk before the job was interrupted.
FWIW, this functionality is recommended for "deep pagination" in lieu of the Scroll API in more recent versions of Elasticsearch.
On random sampling in Elasticsearch: it is possible to create a random score, which seems like it might be useful: the function provides a uniformly distributed score between 0 and 1. In order to get N percentage of the documents, we could take all documents with a score above 1 - N. (For example, if we want 25% of the documents, we should take documents with a randomized score >= .75.)
Some notes:
tasks.py
(line 42) uses thesearch.scan
iterator from elasticsearch_dsl. The defaultsize
parameter retrieves batches of 1000 results. Maybe a larger value forsize
would result in greater efficiencies.scan
API also takes ascroll
parameter, set by default to'5min'
. According to the documentation, this parameter "specifi[ies] how long a consistent view of the index should be maintained for scrolled search." If our data extraction tasks are taking much longer than 5 minutes (as it seems), does this pose any problems?