grpc / grpc-web

gRPC for Web Clients
https://grpc.io
Apache License 2.0
8.56k stars 762 forks source link

ci: add support for linux-arm64 build #1214

Closed strophy closed 2 years ago

strophy commented 2 years ago

This PR (my first here!) contains GitHub Actions CI to build linux-arm64 artifacts in CI, partially addressing some of the issues raised in #1159.

While writing this I was unable to use (deprecated) docker-compose or the newer docker compose or docker bake system to build a compose file with depends_on. I think this may be due to different behaviour between Docker classic builds and buildx, but buildx is required for multi-arch images. As a result, I had to explicitly build and push each Dockerfile manually and in sequence so running the action will now also push updated prereq and protoc_plugin images to Docker Hub. Together with running in an emulator, this is quite slow (although caching improves this on subsequent runs), so I'm open to suggestions for improvements. It might also make sense to separate out building docker images and building a release, depending on how often you release.

Several underlying versions and dependencies had to be bumped in order to build successfully on arm64, and I switched from bazel to bazelisk, which was also necessary to build.

Dependency Updates:

linux-foundation-easycla[bot] commented 2 years ago

CLA Signed

The committers listed above are authorized under a signed CLA.

strophy commented 2 years ago

I believe this PR can now be improved with the techniques described here: https://www.docker.com/blog/dockerfiles-now-support-multiple-build-contexts/

Before I do this, is this project still maintained? Is this feature even wanted or likely to be merged? Just checking before I do any more work, since I haven't seen any interaction here.

sampajano commented 2 years ago

@strophy Thanks so much for the contrib! Apologies on the delay -- i saw this PR a while ago but i was unfamiliar with many of the concepts / tools used here so i didn't get around to it until now. Sorry again!

Pardon my ignorance, but i wonder how popular is linux-arm64? What's its most common use cases?

In general, we're aiming to provide pre-built binaries for the most popular platforms, and so i'm not sure we're willing to pay for the complexity of maintaining for less popular ones, especially if the complexity is high.

There are a few things in this PR that i'm not too sure about.. and i wonder if they can be addressed somehow:

  1. It's complicating the workflow by not using docker compose -- that might be able to be addressed if you completely create a new workflow rather than reusing the existing one (and make it an experimental one).

  2. Using some seemingly less official packages (e.g. tonistiigi/binfmt) and requires temporary hacks like https://github.com/bazelbuild/bazel/issues/11379 -- again showing the support is maybe too cutting-edge.

  3. Updating many dep versions (e.g. buildpack-deps, node, buildifier) which i'm not sure are all necessary and desired... (they maybe are, but again showing that by supporting linux-arm64 we're moving a lot closer to the cutting edge.. :))

Also, curious how is this PR related to https://github.com/grpc/grpc-web/issues/1159 as you've mentioned in the PR description? Do you think this will be a progressive step to getting us M1 support? (and if so how big is the gap there?)

In any case, thanks so much for the PR again!


Btw, just ran CI and it's failing early. Could you take a look at the failure (link)? Thanks!

strophy commented 2 years ago

Thanks for taking a look at this! I agree with all of the points you raised, this PR currently has too many hacks to merge at the moment. The point you raise in your response (1) above is the key of the issue though - in my testing, the current docker-compose or docker compose based build system also does not work, so I'm curious about how you used this to successfully build in the past. I am experiencing the following problem:

  1. Clone the project and run docker compose build prereqs protoc-plugin
  2. We would expect that Docker Compose respects the depends_on statements in docker-compose.yml, but this is not the case. depends_on is only respected by the up and stop commands, it is ignored by the build command. Documentation. Discussion.
  3. The result is that instead of waiting for grpcweb/prereqs to be built, it instead immediately pulls the already existing old version from Docker Hub and uses this to start the grpcweb/protoc-plugin build in parallel with grpcweb/prereqs. This causes my build to fail, because the Docker Hub version of the image is incompatible with the changes my build requires.
  4. Instead, if I "manually" target the steps in two commands like docker compose build prereqs && docker compose build protoc-plugin then the build is successful, because grpcweb/prereqs will exist in the locally available images when the second command runs.

This is why I separated the build CI into steps and stopped using Docker Compose. Can you clarify if this is expected behavior for you? Have you been able to get docker compose build to respect the depends_on statements somehow? If not, what do you think about using the following approach as a way forward: https://github.com/docker/buildx/blob/master/docs/reference/buildx_bake.md#using-a-result-of-one-target-as-a-base-image-in-another-target

strophy commented 2 years ago

Regarding the usefulness of the linux/arm64 this will make it possible to run images depending on grpc-web binaries on M1 Macs in a Docker environment without emulation or Rosetta. It is also frequently used in AWS datacentres, where the Graviton 1, 2 and 3 family of CPUs is available for Linux workloads at a 40% price saving over x86. Finally, devices like the Raspberry Pi and other low-power ARM boards are increasingly prevalent.

linux-foundation-easycla[bot] commented 2 years ago

CLA Signed

The committers listed above are authorized under a signed CLA.

strophy commented 2 years ago

Hello, I have refactored the PR to use docker bake and a .hcl file instead of docker compose and a docker-compose.yml file, since I believe that the logic used to build in the past incorrectly assumes that Docker Compose will respect depends_on when building (it does not). The build logic now actually waits for the prereqs stage to complete, instead of pulling a version based on a previous build from Docker Hub. We also now build in a matrix, so x86_64 and arm64 builds take place on separate runners, which greatly simplifies the Github Actions workflow.

I have documented the various build errors that occur if I do not bump the dependency versions. It might be possible to resolve some of these in another way, but I am tempted to just use newer versions that take non-x86 targets into account rather than debug dependencies.

I still occasionally face the following issues:

sampajano commented 2 years ago

Hi thanks for the reply!

Thanks for taking a look at this! I agree with all of the points you raised, this PR currently has too many hacks to merge at the moment. The point you raise in your response (1) above is the key of the issue though - in my testing, the current docker-compose or docker compose based build system also does not work, so I'm curious about how you used this to successfully build in the past. I am experiencing the following problem:

  1. Clone the project and run docker compose build prereqs protoc-plugin
  2. We would expect that Docker Compose respects the depends_on statements in docker-compose.yml, but this is not the case. depends_on is only respected by the up and stop commands, it is ignored by the build command. Documentation. Discussion.

Oh aha i see!! Thanks for providing the context!

Actually, historically i've always been using docker-compose (instead of docker compose) for both build and up, and as i tried it just now, it seems that docker compose has a significant different behavior in that it no longer respect the depends_on statement as you've pointed out.

This is actually something an interesting point that you've brought up because i've noticed the mismatch between build-time v.s. runtime dependency before, but i didn't know if there's an alternative way of specifying build-time dependency other than using depends_on.. (guess it won't work now either 🙃)

  1. The result is that instead of waiting for grpcweb/prereqs to be built, it instead immediately pulls the already existing old version from Docker Hub and uses this to start the grpcweb/protoc-plugin build in parallel with grpcweb/prereqs. This causes my build to fail, because the Docker Hub version of the image is incompatible with the changes my build requires.
  2. Instead, if I "manually" target the steps in two commands like docker compose build prereqs && docker compose build protoc-plugin then the build is successful, because grpcweb/prereqs will exist in the locally available images when the second command runs.

This is why I separated the build CI into steps and stopped using Docker Compose. Can you clarify if this is expected behavior for you? Have you been able to get docker compose build to respect the depends_on statements somehow? If not, what do you think about using the following approach as a way forward: https://github.com/docker/buildx/blob/master/docs/reference/buildx_bake.md#using-a-result-of-one-target-as-a-base-image-in-another-target

I really appreciate you finding an alternative solution using the buildx here! Although, i'm a bit hesitant in doing this across the board (other than just in this workflow), because we have a bunch of documentations in various sites that we need to update too (example) if we decide to do that, and i'm not yet convinced that docker buildx bake + HCL is broadly adopted enough for that purpose.. (maybe it is? :))

For that reason, may I ask you to "fork" the workflow into a separate make-plugin-linux-arm64.yml config? So that it remains a bit "experimental" until we eventually decide to converge? Thanks!

strophy commented 2 years ago

Thanks for reviewing! I've made the requested changes, but any testing is incredibly frustrating due to random failures running Bazelisk. I can't reproduce it reliably, but it only occurs when running the build of a whole file, stopping the build just before the problematic step and then running the command directly inside the partially built image always works perfectly, so there is no easy way for me to read the contents of the error log. I think it is best if we pause here and wait for the release of Bazel 5.2.0 which should contain some fixes relating to running in an emulator, allowing us to remove some of the hacks in this PR.

I think it's also best if we stop working on caching solutions and do that in a separate PR, since we can now build arm64 in a separate experimental workflow, speed is no longer a priority.

List of updated dependencies:

I'm unable to spend much more time working on this, and I don't know how your CI system (kokoro?) works. From the log output I can see it is using the Docker classic builder to attempt to process a Dockerfile that requires buildx to build, so some changes will be necessary here.

Also interesting to note how building from a compose file will likely be handled in the future here, tl;dr it will parse the compose yml file, generate .hcl files and delegate everything to buildx bake anyway.

sampajano commented 2 years ago

Thanks for reviewing! I've made the requested changes, but any testing is incredibly frustrating due to random failures running Bazelisk. I can't reproduce it reliably, but it only occurs when running the build of a whole file, stopping the build just before the problematic step and then running the command directly inside the partially built image always works perfectly, so there is no easy way for me to read the contents of the error log. I think it is best if we pause here and wait for the release of Bazel 5.2.0 which should contain some fixes relating to running in an emulator, allowing us to remove some of the hacks in this PR.

Oh ok i see!! Very sorry to hear about the frustrations! I'm perfectly fine to pause here and wait for the 5.2.0 release! 😃

I think it's also best if we stop working on caching solutions and do that in a separate PR, since we can now build arm64 in a separate experimental workflow, speed is no longer a priority.

Sounds great to me too! Speed is indeed not an issue here at all (since we only do this every x months really..) 😃

List of updated dependencies:

  • All node base images now use node:16.15.0-bullseye
  • prepare stage now uses buildpack-deps:bullseye for build toolchain consistency, although I think this bump is not strictly necessary
  • Buildifier now at 5.1.0 because 1.0.0 does not support arm64
  • Bazel 4.1.0 replaced with Bazelisk 1.11.0 (installs Bazel 5.1.1, currently broken, wait for 5.2.0)
  • google-closure-compiler JS dependency bumped to ~20220301.0.0 to avoid build error
  • google-protobuf JS dependency bumped to 3.19.4 to match existing dependency used in prereqs image

Thanks so much for the list of deps! I'll edit the PR summary to add these 😃

I'm unable to spend much more time working on this, and I don't know how your CI system (kokoro?) works. From the log output I can see it is using the Docker classic builder to attempt to process a Dockerfile that requires buildx to build, so some changes will be necessary here.

Do you think the Bazel 5.2.0 will help resolve the issue here?

I think our kokoro CI essentially just runs the basic and interop scripts. Do they work locally for you?

Also interesting to note how building from a compose file will likely be handled in the future here, tl;dr it will parse the compose yml file, generate .hcl files and delegate everything to buildx bake anyway.

Oh wow! That's awesome! Would probably get a great improvement when it happens!

strophy commented 2 years ago

Hi @sampajano Bazel 5.2.0 was released with a lot of ARM fixes, but I am still seeing the same error (Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/root/.cache/bazel/_bazel_root/94ede7e4be7ee112c06d5210902f7b7d/server/jvm.out')) when running in an emulator. I see you guys already completed arm64 CI using Zig which comes with an impressive speedup, so I think it's probably best to give up on Bazel for building arm64 for now! You can merge this as an alternative approach, or feel free to close this PR :smiley:

sampajano commented 2 years ago

Hi @sampajano Bazel 5.2.0 was released with a lot of ARM fixes, but I am still seeing the same error (Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/root/.cache/bazel/_bazel_root/94ede7e4be7ee112c06d5210902f7b7d/server/jvm.out')) when running in an emulator. I see you guys already completed arm64 CI using Zig which comes with an impressive speedup, so I think it's probably best to give up on Bazel for building arm64 for now! You can merge this as an alternative approach, or feel free to close this PR 😃

@strophy Thanks so much for checking again! Indeed the Zig solution is already checked in and seems to be a more self-contained change overall (and seems easy to apply to all existing platforms too). 😃

Thank you so much for the contributions here and iterations again! 😃 I'll close this PR for now but i think whatever discussion/change we have here would help provide a great context should we ever need to reevaluate our build method! 😃