nodejs / build

Better build and test infra for Node.
507 stars 166 forks source link

Integrity checks for R2 migration #3469

Open UlisesGascon opened 1 year ago

UlisesGascon commented 1 year ago

TL:DR;

We will change the way we serve the binaries, so we want to ensure that the binaries are properly migrated. Additionally, we can take this opportunity to have some scripts (potentially GH actions) that we can use to check if the binaries are fine and the releases are correct.

Historical Context

We had being suffering from cache problems for a while:

Seems like the long term solution will be to relocate the binaries to R2:

Implementation

I started building a simple GitHub Action that collects all the releases and generates the URLs for all the available binaries. It then performs a basic HTTP request using curl to check the response headers. After that, it generates some metrics based on this and presents a simple report in markdown format.

While presenting this proof of concept in Slack, the collaborators provided super useful feedback and suggested features that we can implement.

Current approach

The idea of using a CRON Job to collect availability metrics may not be very effective for the cache issues scenario, but there are many features that can be valuable to us.

Features requested/ideas

I will request to transfer the repo to the Node.js org when the code is stable and documented, currently is quite hacky code

Next steps

I have started to consolidate the feedback into issues:

Discovery

There are some things that bubble to the surface while implementing the systematic checks:

richardlau commented 1 year ago

While I appreciate the effort I have some concerns.

I think you're trying to check two separate issues:

  1. The integrity of the files. e.g. are the SHASUMS properly signed and do the files match the SHAs?
  2. Whether the URL(s)/webserver is responding.

We currently do a very limited version of 1. in validate-downloads which only checks the binaries for the most recent versions of Node.js 16, 18 and 20 using jenkins/download-test.sh. It runs once per day (or on demand if manually run in Jenkins).

Cases where the files do not match the SHAs published in the SHASUMS:

For 2. we currently know that we have cache purge issues that affect any number of the download URLs -- the extra monitoring if we were checking over HTTP every existing asset URL would be contributing negatively to the server load (even if retrieving just the headers as connection(s) will need to be made to the server).

I started building a simple GitHub Action that collects all the releases and generates the URLs for all the available binaries. It then performs a basic HTTP request using curl to check the response headers.

I hope this has rate limiting implemented -- this will be hundreds of files/HTTP requests.

UlisesGascon commented 1 year ago

Thanks a lot for the feedback @richardlau! :)

We currently do a very limited version of 1. in validate-downloads which only checks the binaries for the most recent versions of Node.js 16, 18 and 20 using jenkins/download-test.sh. It runs once per day (or on demand if manually run in Jenkins)

I was not aware of this job, and it basically covers a lot of the things that I was expecting to cover, so fewer things in my to-do list. šŸ‘

Cases where the files do not match the SHAs published in the SHASUMS:

Only one case is relevant here: the infrastructure has been compromised and a malicious actor has tampered with the files.

We can check if the shasum files were modified. I already collect and update them when new releases are added. You can find them here. Then I can check if any of the checksums have changed and/or if the signatures are valid (in case of additions, aka new releases).

This way, we ensure that the immutability is still in place and there is no tampering with the new additions. The number of HTTP requests is quite low because the binary checksums are collected from the SHASUMS. The script only downloads the SHASUM files.

This can be a weekly job, executed on the weekends.

For 2. we currently know that we have cache purge issues that affect any number of the download URLs -- the extra monitoring if we were checking over HTTP every existing asset URL would be contributing negatively to the server load (even if retrieving just the headers as connection(s) will need to be made to the server).

I hope this has rate limiting implemented -- this will be hundreds of files/HTTP requests.

It ran during the weekend for a while and I have already removed the CRON. However, it can still be executed manually either on the local machine or by triggering the workflow in Github. I believe we can use this script for the R2 migration to ensure that all the binaries are transferred and that all the URLs are functioning correctly. Please note that the script only checks the headers and closes the connection, it does not attempt to download the binaries.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

MattIPv4 commented 4 months ago

I don't believe this is stale, these checks will still be crucial once the R2 migration is completed.

flakey5 commented 4 months ago

I don't believe this is stale, these checks will still be crucial once the R2 migration is completed.

+1