nodejs / release-cloudflare-worker

Infra for serving Node.js downloads and documentation.
https://nodejs.org/dist
MIT License
21 stars 5 forks source link

Optimizing directory listing #114

Open flakey5 opened 2 months ago

flakey5 commented 2 months ago

Issue

Directory listing through R2's S3 api isn't super performant, especially when compared to nginx's ability on the DO server. For /download/release/, R2 takes ~3sec uncached while nginx takes about ~1sec uncached. Of course, most of the requests will be cached and so there shouldn't be any noticeable impact, but it's still not great imo.

Proposal

Caching every path in the bucket within a json file that we can use similarly to how we use redirectLinks.json.

The structure for the json file would look something like,

interface Directory {
  // Directories within this directory
  directories?: Record<string, Directory>;
  // Files within this directory
  files?: string[]
}

So, a path like nodejs/release/vX.X.X/node.exe would be stored as:

{
  "directories": {
    "nodejs": {
      "directories": {
        "release": {
          "directories": {
            "vX.X.X": {
              "files": ["node.exe"]
            }
          }
        }
      }
    }
  }
}

I made a script that generated ~29,000 absolute paths and converted them to the data structure shown above. I searched for three different paths and timed the results with console.time: image

So, from 3 seconds down to 0.1 seconds for a cold start and ~0.01 seconds when hot.

There is a drawback to this however: 29,000 paths isn't the full amount of paths that exist in the bucket, and the json file is already at 1.5mb. The worker should have a size limit of 10mb according to the Cloudflare docs, but I don't know how big the final tree will be. This will only grow as well with each new release.

One alternative we could do is to just do this for the most popular directories as a fast path, and if it doesn't exist within the tree then we send the listing request to the S3 api like we do currently.

MoLow commented 2 months ago

SGTM, I wonder if this isnt just another cache layer but if the implementation isn't too complex it can be ok

flakey5 commented 1 month ago

Holding off on this till the provider concept is fully implemented so I can have a better idea of how it can be implemented nicely