allow sampling to fetch historical data

woolfg commented 1 year ago

If we can reduce the number of requests needed to init a new podcast, it would speed up things a lot:

Consider the example of fetching 50 days of 98 episodes

DEFAULT_EPISODE_ENDPOINTS="detailedStreams,listeners,performance,aggregate"
DEFAULT_PODCAST_ENDPOINTS="metadata,detailedStreams,listeners,aggregate,followers,episodes"

5 podcast requests + 50 podcast aggregate + 98 (3 requests + 50 aggregate) = 55 + 98*53 = 5249 * 2sec = ~3h

The aggregate API fetches detailed gender/age information for one day. As streams, listeners etc. is covered by other APIs, this is just the fine grained gender/age data. This could be reduced by e.g. fetching one week and distribute the aggregated data of 7 days to the respective days.

This could be achieved by e.g.:

change the pipeline to fetch 7 days by default
change the API so it checks the date and if it is not one day, divide the data by 7 and store 1/7th of the data per day in the date range

mre commented 1 year ago

One way to do it would be to introduce a SamplingGenerator or DateGenerator. I would yield the next date for sampling.

import datetime

class DateGenerator:
    """
    Yields a sampled list of dates in the format YYYY-MM-DD
    Takes a start date and an end date as arguments
    Also takes a sampling rate, which determines the number of days to skip
    ahead by. For example, if the sampling rate is 2, the generator will
    yield every other date between the start and end dates.
    """

    def __init__(self, start_date, end_date, sampling_rate=1):
        self.start_date = start_date
        self.end_date = end_date
        self.sampling_rate = sampling_rate

    def __iter__(self):
        return self

    def __next__(self):
        if self.start_date > self.end_date:
            raise StopIteration
        else:
            date = self.start_date
            self.start_date += datetime.timedelta(days=self.sampling_rate)
            return date.strftime("%Y-%m-%d")

if __name__ == "__main__":
    start_date = datetime.date(2017, 1, 1)
    end_date = datetime.date(2017, 1, 10)
    for date in DateGenerator(start_date, end_date, 1):
        print(date)

This would yield every day between the two dates. If sampling-rate is 2, it would skip one day on each iteration. This way we can define the number of data-points we want to fetch.

There are other options, too, of course.

woolfg commented 1 year ago

the more tricky part is the processing of data and the distribution. we also have to mark the sampled data somehow in the DB and it shouldn't break any graphs, frontends, API consumers (clients)

openpodcast / roadmap

allow sampling to fetch historical data #97