Implement 24h complete rebroadcast of IPNS records to the DHT

mbommerez commented 2 years ago

Context:

w3name will be rebroadcasting IPNS requests every 24 hours to the DHT, so users don't have to do it.

Scope of this ticket:

Setup alarm inside Durable Object instance
Create alarm handler
- Reschedule alarm for the next 24 hours
- POST rebroadcast request to ipns-publisher
Add tests

Acceptance criteria:

[ ] As a user, my IPNS records are rebroadcasted to the DHTs every 24 hours

mbommerez commented 2 years ago

We might need to split this one in 2 parts:

[ ] work to do on API
[ ] work to on IPNS publisher

adamalton commented 2 years ago

After some experimenting and exploring, here's a low-down and a plan:

The good news

We can definitely fetch all of our durable object IDs from the Cloudflare API, in batches of up to 10,000, and can then instantiate each object and do stuff with it.
There's no SQL-style OFFSET for that API, but when you fetch the first 10,000 (or fewer) objects you get a cursor which allows you to offset your next query from there.
So we definitely don't need to store the Durable Object keys in the Cloudflare KV store.

The limitations

In order to be able to instantiate the CF Durable Objects, we need to be running our code on Cloudflare. We can't run the cron jobs on GitHub Actions like we do for web3.storage.
CF cron jobs have a 15 minute CPU limit, but no overall time limit. Given that our cron job is likely to spend most of its time doing I/O (fetching DO IDs from the API and POST-ing to ipns-publisher), that limit is probably not going to affect us anytime soon.
CF cron jobs have a limit of 1000 subrequests, where a "subrequest" includes making a call to a Durable Object.
- Fetching the first X DO IDs from the CF API will require 2 requests, then for each DO it will take another request to call its fetch method (this is the only way to talk to a DO) and another request for the DO to POST to ipns-publisher. So with a limit of 1000 subrequests we'll be able to do a max of 499 records per cron instantiation. Ouch.
- That said, if the subrequests limit doesn't cascade down into the DO, then we could get the DO to make the request to ipns-publisher, and so the Worker would only need to make one subrequest per DO, getting our limit down to 998.
CF has a limit of 3 cron schedules per worker. So a crude approach just scheduling multiple jobs to circumvent the 499/998 limit won't work (and would be icky anyway).

Possible solutions

1. External cron trigger which relays the cursor

We could create a cron job on the ipns-publisher side, or in GH Actions, which calls the CF Worker. The CF Worker would then process some of the DOs, and would return the cursor for the DO IDs query back so that ipns-publisher/GH could then call the CF Worker again to continue where it left off.

I don't like this because (1) it's leaking the implementation detail of the DO IDs query out to another system, and (2) as it wouldn't use a CF cron job, the Worker would be limited to 30 seconds of CPU rather than the 15 minutes that a cron job has. Maybe that would be enough given the 499/998 limit, but it doesn't feel great.

2. "Tree cron"

We could have a CF cron job which fetches the DO IDs, and then rather than processing them directly, makes a fetch call to the worker passing a batch of (say 995) DO IDs for it to process. Assuming that the subrequests limit doesn't cascade down, this would allow the top-level cron job to do just under 1000 batches, and then each worker request could process almost 500/1000 DOs, giving us a total capacity of 995*1000= 995,000 records. We could even extend the tree to multiple levels, allowing for greater capacity.

Better, but not wonderful. Still feels like we're fighting the limits.

3. Store the cron state

We can get around all of these issues if we simply have something in CF which stores where we've got to. Using a DO would suffice. So we'd store something like:

{
    "last_ran_at": "2022-07-12T11:19:17.427Z",
    "reached_last_record": false,
    "cursor_of_next_batch": "abc...xyz",
    "worker_in_progress": false
}

We could then have a CF cron job which runs every 5 minutes/1 hour/whatever, which fetches that state object, has a look at it, and decides what to do (e.g. process next batch). This would allow us to avoid being bitten by any of the limits and without having to pass query cursors out to an external system or have a convoluted tree of tasks of tasks.

The main potential pitfall I can see here are:

The fact that the next cron execution might start before the previous one has finished, so we'd need some careful logic to avoid getting ourselves into overlapping tasks.
If we fail to update the cursor_of_next_batch we'd get ourselves stuck in a loop, so we might want to store another field or two in the state to allow us to check how long we've been attempting the current cursor for.

4. Durable Object Alarms

This is probably easier to implement, but easier to get ourselves stuck into a corner with.

When we create each DO we would call setAlarm() on it, scheduling it for 12 hours' time. Each time the alarm is run, we would publish the record to ipns-publisher and then schedule the alarm for another 12 hours. Easy. The downsides are:

There would be no way to manually re-publish all of the records, as we'd have no way of iterating over them.
Similarly, if we want to increase the frequency of the publishing, we'd have to wait for all the alarms to go off on their current schedule first.
If we fail to reschedule the alarm (e.g. due to an error) then the record just falls into the abyss never to be seen again.
- Alarms will retry up to 6 times, so intermittent errors should mostly be ok, but any other errors would just fail after 6 attempts and then the record would be lost from sight.
- If we make the call to reschedule the alarm first, and then do the publishing, that will reduce our chances of failing to reschedule the alarm. So risk probably low, but effort to dig ourselves out of the hole is essentially to build solution 3..

Decision

IMO it's between 3 and 4. The ease of solution 4 is quite compelling, but I think there's a high chance we might end up implementing 3 at some point in order to solve one of the mentioned pitfalls. So I think it comes down to whether we want to take the quick win now with the risk of doing more work later, or go for the more time consuming but more robust solution now.

Side note: We should add a lastRepublished attribute to the IPNSRecord DO to keep track, which will be handy regardless of which solution we choose.

alanshaw commented 2 years ago

I talked this over with François, alarms are IMO a bit risky - I don't know what happens when you migrate the DO to a new version - all existing instances and (I assume) alarms would be cleared out an we'd need a way to reinstantiate.

I'm not super keen on cron state but I do believe a cron is the simpliest thing that we can do to get this working that would allow us to publish around 5,000,000 IPNS keys (which should keep us going for a while). So specifically:

~~Create a cron that triggers every 12 hours~~
~~Query DO storage to fetch a page of 10,000 items~~
~~POST to a batch ipns-publisher endpoint for publishing records~~
~~Goto 2~~

~~You get 1,000 subrequests, so you can do \~5,000,000 records (1,000 * 10,000 / 2). Could could squeeze more out by batching up more than the page size.~~

Ok so scratch this I think you can't access durable object storage in this way - it's also not shared by all objects.

alanshaw commented 2 years ago

Ok so why don't we do 4 (alarms) but first we need to:

Check that when a DO migration occurs that alarms remain intact.

If they don't then we need:

A script that exports all current IPNS records
- By listing all objects (object ID should be IPNS key)
- Calls GET /name/:key to get the current IPNS record
Then we can deploy a new version (and migrate DOs)
Call POST /name/:key with the current record for each record
...to reinstantiate all DO's and alarms

francois-potato commented 2 years ago

I can confirm:

alarms are not reset when doing a migration (for example a renamed_classes migration)
alarms are not reset when deploying a new version

I tested this with a DO that updates a counter every minute. After a class migration and deploying new versions, the counter was never reset to zero.

It was also possible to cancel an alarm by deploying a new alarm handler that did not reschedule an alarm and did no operation.

export class AlarmCounter2 {
  state: DurableObjectState
  env: Env

  constructor (state: DurableObjectState, env: Env) {
    this.state = state
    this.env = env
  }

  async fetch (request: Request) {
    const currentAlarm = await this.state.storage.getAlarm()
    const alarmIsActive = Boolean(currentAlarm)
    if (currentAlarm === null) {
      await this.state.storage.setAlarm(Date.now() + 60 * SECONDS)
    }

    const value: Number | undefined = await this.state.storage.get('value')

    const data = {
      alarmIsActive,
      value,
      version: 3
    }

    return jsonResponse(JSON.stringify(data), 200)
  }

  async alarm () {
    const value: number | undefined = await this.state.storage.get('value')
    if (value === undefined) {
      await this.state.storage.put('value', 0)
    } else {
      await this.state.storage.put('value', value + 1)
    }

    await this.state.storage.deleteAlarm()
    await this.state.storage.setAlarm(Date.now() + 60 * SECONDS)
  }
}

storacha / w3name