Open woolfg opened 1 year ago
One way to do it would be to introduce a SamplingGenerator
or DateGenerator
.
I would yield the next date for sampling.
import datetime
class DateGenerator:
"""
Yields a sampled list of dates in the format YYYY-MM-DD
Takes a start date and an end date as arguments
Also takes a sampling rate, which determines the number of days to skip
ahead by. For example, if the sampling rate is 2, the generator will
yield every other date between the start and end dates.
"""
def __init__(self, start_date, end_date, sampling_rate=1):
self.start_date = start_date
self.end_date = end_date
self.sampling_rate = sampling_rate
def __iter__(self):
return self
def __next__(self):
if self.start_date > self.end_date:
raise StopIteration
else:
date = self.start_date
self.start_date += datetime.timedelta(days=self.sampling_rate)
return date.strftime("%Y-%m-%d")
if __name__ == "__main__":
start_date = datetime.date(2017, 1, 1)
end_date = datetime.date(2017, 1, 10)
for date in DateGenerator(start_date, end_date, 1):
print(date)
This would yield every day between the two dates. If sampling-rate is 2, it would skip one day on each iteration. This way we can define the number of data-points we want to fetch.
There are other options, too, of course.
the more tricky part is the processing of data and the distribution. we also have to mark the sampled data somehow in the DB and it shouldn't break any graphs, frontends, API consumers (clients)
If we can reduce the number of requests needed to init a new podcast, it would speed up things a lot:
Consider the example of fetching 50 days of 98 episodes
5 podcast requests + 50 podcast aggregate + 98 (3 requests + 50 aggregate) = 55 + 98*53 = 5249 * 2sec = ~3h
The aggregate API fetches detailed gender/age information for one day. As streams, listeners etc. is covered by other APIs, this is just the fine grained gender/age data. This could be reduced by e.g. fetching one week and distribute the aggregated data of 7 days to the respective days.
This could be achieved by e.g.: