storacha / w3name

IPNS client and service for generating, updating, and tracking immutable names with signed updates
Other
45 stars 12 forks source link

Implement 24h complete rebroadcast of IPNS records to the DHT #5

Closed mbommerez closed 2 years ago

mbommerez commented 2 years ago

Context:

w3name will be rebroadcasting IPNS requests every 24 hours to the DHT, so users don't have to do it.

Scope of this ticket:

Acceptance criteria:

mbommerez commented 2 years ago

We might need to split this one in 2 parts:

adamalton commented 2 years ago

After some experimenting and exploring, here's a low-down and a plan:

The good news

The limitations

Possible solutions

1. External cron trigger which relays the cursor

We could create a cron job on the ipns-publisher side, or in GH Actions, which calls the CF Worker. The CF Worker would then process some of the DOs, and would return the cursor for the DO IDs query back so that ipns-publisher/GH could then call the CF Worker again to continue where it left off.

I don't like this because (1) it's leaking the implementation detail of the DO IDs query out to another system, and (2) as it wouldn't use a CF cron job, the Worker would be limited to 30 seconds of CPU rather than the 15 minutes that a cron job has. Maybe that would be enough given the 499/998 limit, but it doesn't feel great.

2. "Tree cron"

We could have a CF cron job which fetches the DO IDs, and then rather than processing them directly, makes a fetch call to the worker passing a batch of (say 995) DO IDs for it to process. Assuming that the subrequests limit doesn't cascade down, this would allow the top-level cron job to do just under 1000 batches, and then each worker request could process almost 500/1000 DOs, giving us a total capacity of 995*1000= 995,000 records. We could even extend the tree to multiple levels, allowing for greater capacity.

Better, but not wonderful. Still feels like we're fighting the limits.

3. Store the cron state

We can get around all of these issues if we simply have something in CF which stores where we've got to. Using a DO would suffice. So we'd store something like:

{
    "last_ran_at": "2022-07-12T11:19:17.427Z",
    "reached_last_record": false,
    "cursor_of_next_batch": "abc...xyz",
    "worker_in_progress": false
}

We could then have a CF cron job which runs every 5 minutes/1 hour/whatever, which fetches that state object, has a look at it, and decides what to do (e.g. process next batch). This would allow us to avoid being bitten by any of the limits and without having to pass query cursors out to an external system or have a convoluted tree of tasks of tasks.

The main potential pitfall I can see here are:

4. Durable Object Alarms

This is probably easier to implement, but easier to get ourselves stuck into a corner with.

When we create each DO we would call setAlarm() on it, scheduling it for 12 hours' time. Each time the alarm is run, we would publish the record to ipns-publisher and then schedule the alarm for another 12 hours. Easy. The downsides are:

Decision

IMO it's between 3 and 4. The ease of solution 4 is quite compelling, but I think there's a high chance we might end up implementing 3 at some point in order to solve one of the mentioned pitfalls. So I think it comes down to whether we want to take the quick win now with the risk of doing more work later, or go for the more time consuming but more robust solution now.

Side note: We should add a lastRepublished attribute to the IPNSRecord DO to keep track, which will be handy regardless of which solution we choose.

alanshaw commented 2 years ago

I talked this over with François, alarms are IMO a bit risky - I don't know what happens when you migrate the DO to a new version - all existing instances and (I assume) alarms would be cleared out an we'd need a way to reinstantiate.

I'm not super keen on cron state but I do believe a cron is the simpliest thing that we can do to get this working that would allow us to publish around 5,000,000 IPNS keys (which should keep us going for a while). So specifically:

  1. Create a cron that triggers every 12 hours
  2. Query DO storage to fetch a page of 10,000 items
  3. POST to a batch ipns-publisher endpoint for publishing records
  4. Goto 2

You get 1,000 subrequests, so you can do \~5,000,000 records (1,000 * 10,000 / 2). Could could squeeze more out by batching up more than the page size.

Ok so scratch this I think you can't access durable object storage in this way - it's also not shared by all objects.

alanshaw commented 2 years ago

Ok so why don't we do 4 (alarms) but first we need to:

  1. Check that when a DO migration occurs that alarms remain intact.

If they don't then we need:

francois-potato commented 2 years ago

I can confirm:

I tested this with a DO that updates a counter every minute. After a class migration and deploying new versions, the counter was never reset to zero.

It was also possible to cancel an alarm by deploying a new alarm handler that did not reschedule an alarm and did no operation.

export class AlarmCounter2 {
  state: DurableObjectState
  env: Env

  constructor (state: DurableObjectState, env: Env) {
    this.state = state
    this.env = env
  }

  async fetch (request: Request) {
    const currentAlarm = await this.state.storage.getAlarm()
    const alarmIsActive = Boolean(currentAlarm)
    if (currentAlarm === null) {
      await this.state.storage.setAlarm(Date.now() + 60 * SECONDS)
    }

    const value: Number | undefined = await this.state.storage.get('value')

    const data = {
      alarmIsActive,
      value,
      version: 3
    }

    return jsonResponse(JSON.stringify(data), 200)
  }

  async alarm () {
    const value: number | undefined = await this.state.storage.get('value')
    if (value === undefined) {
      await this.state.storage.put('value', 0)
    } else {
      await this.state.storage.put('value', value + 1)
    }

    await this.state.storage.deleteAlarm()
    await this.state.storage.setAlarm(Date.now() + 60 * SECONDS)
  }
}