olaven / nrss

Get RSS feeds for NRK's vendor locked "podcasts".
GNU Affero General Public License v3.0
58 stars 3 forks source link

Caching feed data #22

Open olaven opened 5 months ago

olaven commented 5 months ago

Subtasks

Description

NRSS does not work as well as it should because it gets rate limited by NRK's servers. The current implementation always hits NRK's servers every time a client tries to fetch a feed through the NRSS API (which in turn calls NRK's API). @timharek suggested that Deno KV can be used to store data here, and fetched periodically with Deno Cron. While I believe this is a good idea, I suggest that we fetch the latest episodes on demand, not periodically. Additionally, we can use Deno's queueing system to fetch historical episode data without 1) blocking the user/podcatcher and 2) hitting NRK too hard.

Data requirements

This is the data we use to generate an entry in a podcast feed, which we'd need to store.

type Serie = {
  title: string,
  link: string,
  description: string,
  imageUrl: string,
}

type Episode = {
  title: string, 
  link: string, 
  description: string, 
  id: string, 
  date: Date, 
  durationInSeconds: number, 
}

Getting feed / on demand storing

This is the storage flow I imagine.

sequenceDiagram
    podcatcher ->> nrss: GET /feed
    alt feed in db AND last fetched <= 1h?
        nrss --> podcatcher: return the feed 
    else 
        nrss ->> nrk: GET _latest_ episodes
        nrk --> nrss: episode data
        nrss ->> nrss: store unseen episodes
        note right of nrss: do in the background
        nrss ->> nrss: Deno Queue job to download historic episodes
        nrss --> podcatcher: return the feed 
    end

Getting historic data

sequenceDiagram
    note right of nrss: might have to spend time here, to avoid rate limits etc.
    nrss ->> nrk: GET _all_ episodes
    nrss --> nrss: store old episodes in the feed 

EDIT: added subtasks.

olaven commented 5 months ago

@timharek does this make sense to you? I'm very happy to continue the discussion before starting on an implementation!

timharek commented 5 months ago

Awesome, I like and appreciate the detailed summary!

If I understand you correctly:

  1. Use Deno KV as a storage-system for NRSS.
  2. Fetch serie/show on demand when a user requests the feed.
  3. Use Deno Queue to fetch episodes that are not part of the 20 includes episodes when requesting a serie/show.

A pseudo-implementation could look like the following then for each scenario:

1. New GET /feed/{showSlug}

const show = await kv.get(showSlug);

if (!show) {
// Do initial set up for new show in Deno.KV
  return;
}

if (show.lastFetched <= oneHourAgo) {
  return show;
}

if (show.lastFetched >= oneHourAgo) {
  const latestEpFromNRK = await getLatestEpFromNRK(showSlug);
  if (latestEpFromNRK.id === show.episodes.at(-1).id) {
   return;
  }

  show.insert(latestEpFromNRK);
}

2. Getting historic data (allowing for more than 20 episodes)

Iterate over all the episodes in a serie/show with Deno.Queue with a timeout that bypases NRK's rate-limiting.

I don't have a clear pseudo-implementation for this.

Summary

Based on your diagrams and summary, @olaven, I think this is very doable! I think we can do this in two separate phases/branches, so that the new changes won't be too big for one "update".

Have I understood your plan correctly? 😊

olaven commented 5 months ago

Great! Thanks for the detailed response. The pseudocode was especially useful.

Have I understood your plan correctly? 😊

You have!

Based on your diagrams and summary, @olaven, I think this is very doable! I think we can do this in two separate phases/branches, so that the new changes won't be too big for one "update".

Good idea. In my mind it makes sense to do the new GET /feed/{slug} first and then tackle the queue-system afterwards. I'll update this issue so that it becomes an "epic" issue for the rewrite. Testing (like you mentioned in #18) is also a natural part of this, making the program more robust.

I believe I can fit in some time for the new feed endpoint this weekend! However, @timharek, how involved do you want to be? Since you're kind enough to contribute, it's important to me that you work on what you want, if you want. Which parts would you like to work on? "None" is also a perfectly good answer ☺️ You've helped a lot already!

Again, do whatever you want, but let me know so we don't work on the same thing.

olaven commented 5 months ago

(I've added #18, #23 and #24 as subtasks)

timharek commented 5 months ago

In my mind it makes sense to do the new GET /feed/{slug} first and then tackle the queue-system afterwards.

I agree! It will make a good foundation for the queue-system 😊

how involved do you want to be?

It varies how much time I have available now (I'm on pappaperm πŸ˜…). But I think I might be able to sneak in a foundation/starter for unit-tests for where the project is now – if that's helpful.

I don't want to commit to either #23 or #24 if you are able to do either (or both) this weekend. However, I'm more than willing to do a code review afterwards if you want me to have a look, and I can also help along the way if you get stuck or need sparring.

olaven commented 5 months ago

@timharek sounds great! I'll assign #18 to you and #23 to myself, and then we'll se when and where we go from there :) There's no time pressure on anything, naturally.

olaven commented 5 months ago

@timharek this is going well! Thanks for taking the initiative to improve stuff. I'll try to set aside some time to work on #24, but I'm not 100% sure on when.

Probably this weekend or the next one.

Keep up the good work for as long as you like! ☺️