golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.88k stars 17.65k forks source link

x/build/env/linux-mips: we have no Linux MIPS builders #31217

Closed bradfitz closed 4 years ago

bradfitz commented 5 years ago

All four of our MIPS variants have lost their builders and the email address of the old owner now bounces. (changed jobs, presumably)

So, we need to find a new owner, or remove the port per policy (https://github.com/golang/go/wiki/PortingPolicy#removing-a-port)

/cc @randall77 @cherrymui @ianlancetaylor @dmitshur

gopherbot commented 5 years ago

Change https://golang.org/cl/170444 mentions this issue: dashboard: update stale ownership info for now-dead MIPS builders

cherrymui commented 5 years ago

I have a MIPS64LE machine that I used to run as a builder (old style builder, not buildlet). The machine is quite old and slow, so I stopped to run it once the new builders were up. I could probably revive that machine, if desired. (but probably no earlier than next month)

bradfitz commented 5 years ago

Alternatively, I could run it for you if you want to ship or inter-office mail it to me.

Do you remember which board it is?

cherrymui commented 5 years ago

It's a Loongson 2E box.

If I remember correctly it needs some awkward booting process, something like PXE with TFTP from another machine, because I screwed up something (probably part of the hard drive?) long time ago. Once it boots, it should run ok.

I also have a Loongson 2F laptop, which is probably a little better. I think it can boot by itself. But the disk space is pretty small.

bradfitz commented 5 years ago

I see https://www.ebay.com/itm/VL-62851-Creator-Ci20-Linux-Android-MIPS-board/264216586949?hash=item3d84892ac5:g:3DIAAOSww35cdpsa&frcectupt=true on ebay that'll run Debian 8: https://www.elinux.org/CI20_Distros#Debian

So that's another option.

willglynn commented 5 years ago

@bradfitz Same board is available at a lower price and higher quantity elsewhere on eBay or from Mouser, a reputable electronics distributor.

minux commented 5 years ago

I used to run a linux/mips64 (BE) builder on EdgeRouter Lite. (For spec, see https://dl.ubnt.com/datasheets/edgemax/EdgeRouter_Lite_DS.pdf )

But it is slow and doesn't have enough RAM (only 512MB for dual core), and it eventually broke due to overheating.

bradfitz commented 5 years ago

@willglynn, thanks for the links! I've now bought all the ones I found on eBay. We'll see how those work out before I buy any more from Mouser.

omac777 commented 5 years ago

Alternatively, I could run it for you if you want to ship or inter-office mail it to me.

Do you remember which board it is? I have a six-core Loongson 3a board 8GB RAM that used to have an old fedora running on it, but I blew it away while trying to bring debian unstable up onto it using mediatemple mips binaries onto it. I didn't get much support from loongson themselves because I'm not in China. I would need a new hard-drive with the bootable loongnix installed onto it to bring it back up and running enough to run golang on it. the debian loongson experts within qemu never got back to me. It's not a typical board where you can just boot with a cd or a usb drive. I'm not willing to part with this $$$ board, but if you could help me bring the box up again, I'll give you root access via fiber to it. That would be fair.

CetinSert commented 5 years ago

Linux / MIPS is business critical for us! I can get all necessary hardware and setup a new repository for automated builds with unreleased versions of go. Currently we use it for a fleet of IIoT devices with GOOS=linux GOARCH=mips with always the latest release of go (1.12 as of today). Hardware is all openwrt routers / router boards.

CetinSert commented 5 years ago

@bradfitz I have a doctor's appointment for the next few hours but am ready to take a role as builder for all 4 MIPS variants. Please let me know what next to do; otherwise I will review posts and documents and get in touch with a hardware / time plan. I expect getting all necessary and future hardware (for the other 3 variants (we only use GOARCH=mips in automated builds)) will take mere days because I live in Korea, relatively close to Shenzhen.

bradfitz commented 5 years ago

@cetinsert, great to hear! How much RAM and what type of storage do those devices have? We tend to require 1GB of RAM and pretty decent storage (SD cards fail very quickly, so putting its working directory on a USB SSD disk is better). They also need to have moderately okay network, but if you're in South Korea you're probably good. :-)

https://golang.org/wiki/DashboardBuilders has more info. I can get you the necessary keys over email.

CetinSert commented 5 years ago

@bradfitz ok so we were not talking about cross-compiling for mips o__O (because that's what we do from linux-amd64, darwin-amd64 and windows-amd64)?

Let me review https://golang.org/wiki/DashboardBuilders!

bradfitz commented 5 years ago

@cetinsert, we kinda have support for that, at least partially. The builders support cross-compiling the make.bash step but when running tests it still does on-device compiles of each test, which can still be big. We don't yet have support for cross-compiling the test binaries. Even if we did, that'd increase the network requirements a bit.

CetinSert commented 5 years ago

@bradfitz I see VMs listend here https://farmer.golang.org/builders#host-linux-mips. If VMs are ok, please send me the keys over email (mips@shin.asia)!

I have several Scaleway x86-64 (2 x amd64 cores, 2GB RAM) machines with good network conditions and can setup QEMU? and other essentials for MIPS emulation or take over existing work in the form of VM images and host them from then on.

9nut commented 5 years ago

@bradfitz I can contribute a Ci20; I'll order one today. If it's time critical, I have a mikrotik rb450g (256mb) and a vocore2 (128mb) that I can send your way; probably only useful for testing. Incidentally, the rb450g runs Plan 9 via tftp/bootp -- if/when plan9/mips32 is available.

FlyGoat commented 5 years ago

@omac777

Alternatively, I could run it for you if you want to ship or inter-office mail it to me. Do you remember which board it is? I have a six-core Loongson 3a board 8GB RAM that used to have an old fedora running on it, but I

I assume you have a Loongson-3B1500 since Loongson-3A only have four cores. 3B1500 is a octa-core model but two cores are disabled on some boards due to a hardware errata.

blew it away while trying to bring debian unstable up onto it using mediatemple mips binaries onto it. I didn't get much support from loongson themselves because I'm not in China. I would need a new hard-drive with the bootable loongnix installed onto it to bring it back up and running enough to run golang on it. the debian loongson experts within qemu never got back to me. It's not a typical board where you can just boot with a cd or a usb drive. I'm not willing to part with this $$$ board, but if you could help me bring the box up again, I'll give you root access via fiber to it. That would be fair.

Could you please tell me the model of your board? It should be displayed in firmware, or printed on circuit board. Probably you just need update PMON or klBIOS to make it boot from USB stick.

And now, we have a Fedora28 Port: http://mirror.lemote.com:8000/fedora/fedora28-live/ You can try to write it to a USB stick and boot or even write it directly to your hard drive.

Anyway, though I'm not a Loongson employee, I'm familiar with Loongson devices and Loongson developers, email me if you need any help.

bradfitz commented 5 years ago

@9nut, thanks, Skip, but I have a few Ci20s on the way already. No need to buy another. The rb450g could be interesting, though, as that's a 32-bit BE CPU it seems? Not much RAM, but it'd give us a MIPS CPU variant I don't think we have.

I don't think there are any plans for more plan9 ports (@0intro?). But really the first priority is getting Linux back.

CetinSert commented 5 years ago

@bradfitz would QEMU or another mips emulator work?
If yes, I can dedicate an x86-64 host with 2GB RAM and SSD storage.

For a true mips host, we will be reviewing hardware options shortly.
Our mips-based target devices have nowhere near 1GB of RAM.

bradfitz commented 5 years ago

@cetinsert, QEMU is generally too slow and sometimes I hear it's too forgiving (e.g. accepting unaligned instructions that real hardware would reject). So we try to use real hardware when possible. If we do decide to go the emulation route we can run that on GCE where we have tons of x86 capacity.

0intro commented 5 years ago

Currently, Plan 9 runs on MikroTik RB450G (MIPS32 BE) and Lemote Yeeloong (MIPS64 LE). OCTEON (MIPS64 BE) and CI20 (MIPS32 LE) ports were in progress.

However, there is currently no plan to port Go on plan9/mips, thought it could be interesting.

Anyway, I think these board would likely be a little tight to run a (Linux) Go builder.

Ideally, you could try to get your hands on an OCTEON board, which usually have multiple cores and multiple gigabytes of memory.

CetinSert commented 5 years ago

@bradfitz which hardware was used by the former builders for the 4 variants for Linux?

bradfitz commented 5 years ago

@cetinsert, I don't actually know. It was run by somebody at MIPS who no longer appears to be employed there.

CetinSert commented 5 years ago

I have just asked MIPS and its parent company for review of this issue via their public contact forms.

0intro commented 5 years ago

There are currently OCTEON XL NICPro boards (16 cores @ 500 MHz, 2 GB memory) available on eBay for $40. For more, you could probably find something better, like the 750 MHz variant of this board or the newer models.

CetinSert commented 5 years ago

Can MIPS64 CPUs execute MIPS32 binaries for same endianness?

0intro commented 5 years ago

Yes, you can boot a 32-bit Linux kernel on a MIPS64 board.

cherrymui commented 5 years ago

Can MIPS64 CPUs execute MIPS32 binaries for same endianness?

Generally yes, but depends. If you want to use one kernel to run both, you'd need to configure the kernel as such (I think it does by default). Also for cgo support I think you'd need to install both 32-bit and 64-bit C toolchain and libraries. Of course, you can also use two separate kernels and two separate user spaces, and switch between the two by rebooting.

Also note that the Go toolchain generates MIPS32R1 ISA for MIPS32, and MIPS-III ISA for MIPS64. MIPS32R1 is newer. So a 64-bit MIPS-III machine may not actually run the 32-bit Go port.

omac777 commented 5 years ago

On 4/3/19 8:01 AM, Jiaxun Yang wrote:

Alternatively, I could run it for you if you want to ship or inter-office mail it to me. Do you remember which board it is? I have a six-core Loongson 3a board 8GB RAM that used to have an old fedora running on it, but I

I assume you have a Loongson-3B1500 since Loongson-3A only have four cores. 3B1500 is a octa-core model but two cores are disabled on some boards due to a hardware errata. That's probably correct. 3B1500. I just tried booting it this morning, but no display on dvi nor vga. I bought a new lithium battery for the motherboard this morning and tried it again, but still no display. I tried pulling out and re-inserting all the different cable connectors back into the motherboard, but again no display.

I get one beep at bootup. That's it. I don't have the manual so I don't know what that one beep describes.

I think it would be best to place this motherboard in the hands of someone more capable than me. I am truly finding this motherboard a huge waste of my time.

blew it away while trying to bring debian unstable up onto it using mediatemple mips binaries onto it. I didn't get much support from loongson themselves because I'm not in China. I would need a new hard-drive with the bootable loongnix installed onto it to bring it back up and running enough to run golang on it. the debian loongson experts within qemu never got back to me. It's not a typical board where you can just boot with a cd or a usb drive. I'm not willing to part with this $$$ board, but if you could help me bring the box up again, I'll give you root access via fiber to it. That would be fair.

Could you please tell me the model of your board? It should be displayed in firmware, or printed on circuit board. Probably you just need update PMON or klBIOS to make it boot from USB stick.

And now, we have a Fedora28 Port: http://mirror.lemote.com:8000/fedora/fedora28-live/ You can try to write it to a USB stick and boot or even write it directly to your hard drive.

Anyway, though I'm not a Loongson employee, I'm familiar with Loongson devices and Loongson developers, email me if you need any help.

daniel-santos commented 5 years ago

FYI, 1.12.4 builds for mipsel-openwrt-linux-musl cross-compiling with gcc 7.3.0. Will be running tests today.

daniel-santos commented 5 years ago

Not bad for flying blind, far fewer failures than 1.10. Aside from not having gcc installed for fixedbugs/issue10607.go and running out of memory for fixedbugs/issue10607.go only the one that also fails in 1.10. This is a MIPS 24KEc with 256MiB ram.

# go run run.go -- nilptr.go
exit status 1
signal: segmentation fault

FAIL    nilptr.go       5.088s

Also, have you considered running them in qemu? Not sure if you would be testing Go or qemu at that point though. :)

gopherbot commented 5 years ago

Change https://golang.org/cl/177918 mentions this issue: all: add linux-mips64le qemu builder, cross-compiling from a fast VM

gopherbot commented 5 years ago

Change https://golang.org/cl/178399 mentions this issue: cmd/dist: support using cross-compiled std test binaries for slow builders

mengzhuo commented 5 years ago

I have a Loongson 3B1500 box. I will try to add it to the build farm ASAP.

bcmills commented 5 years ago

The Go 1.14 development window is opening this week. If we're going to maintain support for the MIPS port in 1.14, it needs to have working builders.

mengzhuo commented 5 years ago

I have submitted CL https://go-review.googlesource.com/c/build/+/191577

Hope it can help.

gopherbot commented 5 years ago

Change https://golang.org/cl/193017 mentions this issue: dashboard: remove nonexistent linux-mips builders

gopherbot commented 5 years ago

Change https://golang.org/cl/191577 mentions this issue: dashboard: add linux-mipsle-mengzhuo builder

dmitshur commented 5 years ago

@mengzhuo I've submitted CL 191577, then redeployed the coordinator and the build.golang.org dashboard. Your builder appears to be running for go, net, and sys repositories as configured. It's getting this far:

Building Go cmd/dist using /usr/lib/golang.
Building Go toolchain1 using /usr/lib/golang.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.

Error: error copying response: unexpected EOF
mengzhuo commented 5 years ago

@dmitris Hi, I'm sorry. However it seems to be a network issue in China. I've change to another proxy server in HongKong. Can you stop all mipsle pending builds?

dmitshur commented 5 years ago

No problem, it's expected that new builders take some time to work out the issues before they're fully operational.

Can you stop all mipsle pending builds?

We don't have an easy way of doing this, other than to remove the builder in code and redeploy. (As far as I know? /cc @bradfitz)

If it's not a problem for you, it might be better to keep it on. Once you resolve the networking issues, you can immediately start seeing what the next issue is, if any, without waiting to re-add the builder.

If you're worried about incorrect "FAIL" results, we can easily clear those when the builder is in a working state.

Let me know how you'd prefer to proceed.

mengzhuo commented 5 years ago

@dmitris OK, I will fix this today

(I have to go to work now :(

dmitshur commented 5 years ago

There's no immediate rush, this is a 5 month old issue. Take the time you need. Thank you for helping out with this!

mengzhuo commented 5 years ago

@dmitris The log shows

Error: writeSnapshot: local error: tls: bad record MAC

I can only find writeSnapshot function at https://github.com/golang/build/blob/01fd29966998a0a3ecd5d721de6bcde3ea9b9a6f/cmd/coordinator/coordinator.go#L2161 Is something wrong with snapshot server?


      rev: e7e2b1c2b91320ef0ddf025d330061d56115dd53
 buildlet: http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  started: 2019-09-11 23:53:42.184067314 +0000 UTC m=+58.957360392
    ended: 2019-09-12 03:57:57.089056773 +0000 UTC m=+14713.862349853
  success: false

Events:
  2019-09-11T23:53:42Z checking_for_snapshot 
  2019-09-11T23:53:42Z finish_checking_for_snapshot after 191.2ms
  2019-09-11T23:53:42Z get_buildlet 
  2019-09-11T23:53:42Z wait_static_builder host-linux-mipsle-mengzhuo
  2019-09-11T23:53:42Z waiting_machine_in_use 
  2019-09-12T03:48:32Z finish_wait_static_builder after 3h54m49.4s; host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z clean_buildlet http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z finish_clean_buildlet after 443ms; http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z finish_get_buildlet after 3h54m50s
  2019-09-12T03:48:32Z using_buildlet ls-3a3k
  2019-09-12T03:48:32Z write_version_tar 
  2019-09-12T03:48:32Z get_source 
  2019-09-12T03:48:33Z finish_get_source after 0s
  2019-09-12T03:48:33Z write_go_src_tar 
  2019-09-12T03:50:00Z finish_write_go_src_tar after 1m27.4s
  2019-09-12T03:50:00Z make_and_test 
  2019-09-12T03:50:00Z make src/make.bash
  2019-09-12T03:56:10Z finish_make after 6m10.2s; src/make.bash
  2019-09-12T03:56:10Z clean_for_snapshot 
  2019-09-12T03:56:11Z finish_clean_for_snapshot after 178.3ms
  2019-09-12T03:56:11Z write_snapshot_to_gcs 
  2019-09-12T03:56:11Z fetch_snapshot_reader_from_buildlet 
  2019-09-12T03:56:11Z finish_fetch_snapshot_reader_from_buildlet after 345.6ms
  2019-09-12T03:57:56Z finish_write_snapshot_to_gcs after 1m45.7s; err=local error: tls: bad record MAC
  2019-09-12T03:57:56Z finish_make_and_test after 7m56.3s; err=writeSnapshot: local error: tls: bad record MAC

Build log:
linux-mips64le-mengzhuo at e7e2b1c2b91320ef0ddf025d330061d56115dd53

:: Running /tmp/workdir-host-linux-mipsle-mengzhuo/go/src/make.bash with args ["/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/make.bash"] and env ["LANG=en_US.UTF-8" "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin" "GO_BUILDER_ENV=host-linux-mipsle-mengzhuo" "WORKDIR=/tmp/workdir-host-linux-mipsle-mengzhuo" "HTTPS_PROXY=http://127.0.0.1:8123" "HTTP_PROXY=http://127.0.0.1:8123" "USER=root" "HOME=/root" "GO_STAGE0_NET_DELAY=800ms" "GO_STAGE0_DL_DELAY=300ms" "GOROOT_BOOTSTRAP=/tmp/workdir-host-linux-mipsle-mengzhuo/go1.4" "GO_BUILDER_NAME=linux-mips64le-mengzhuo" "GO_BUILDER_FLAKY_NET=1" "GOROOT_BOOTSTRAP=/usr/lib/golang" "GOMIPS64=hardfloat" "GOBIN=" "TMPDIR=/tmp/workdir-host-linux-mipsle-mengzhuo/tmp" "GOCACHE=/tmp/workdir-host-linux-mipsle-mengzhuo/gocache"] in dir /tmp/workdir-host-linux-mipsle-mengzhuo/go/src

Building Go cmd/dist using /usr/lib/golang.
Building Go toolchain1 using /usr/lib/golang.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/mips64le.
---
Installed Go for linux/mips64le in /tmp/workdir-host-linux-mipsle-mengzhuo/go
Installed commands in /tmp/workdir-host-linux-mipsle-mengzhuo/go/bin

Error: writeSnapshot: local error: tls: bad record MAC```
odeke-em commented 5 years ago

@mengzhuo I've replied to your report at https://github.com/googleapis/google-cloud-go/issues/1581#issuecomment-531614955 asking some questions.

odeke-em commented 5 years ago

@mengzhuo actually since this involves running lots of code on your builder, what you can do to quickly verify if the problem is with the Go's TLS v1.3 vs Google TLS v1.2 interaction, you can apply this patch to your builder code in cmd/coordinator/gce.go

diff --git a/cmd/coordinator/gce.go b/cmd/coordinator/gce.go
index e4e702d..809c647 100644
--- a/cmd/coordinator/gce.go
+++ b/cmd/coordinator/gce.go
@@ -12,6 +12,7 @@ package main

 import (
    "context"
+   "crypto/tls"
    "encoding/json"
    "errors"
    "fmt"
@@ -44,6 +45,7 @@ import (
    "golang.org/x/oauth2/google"
    compute "google.golang.org/api/compute/v1"
    "google.golang.org/api/googleapi"
+   "google.golang.org/api/option"
 )

 func init() {
@@ -137,20 +139,34 @@ func initGCE() error {
    cfgDump, _ := json.MarshalIndent(buildEnv, "", "  ")
    log.Printf("Loaded configuration %q for project %q:\n%s", *buildEnvName, buildEnv.ProjectName, cfgDump)

+   opts := []option.ClientOption{
+       // Force TLS 1.2 in the HTTP client because of issues:
+       // * https://github.com/golang/go/issues/31217
+       // * https://github.com/googleapis/google-cloud-go/issues/1581
+       // in which there might be a bad interaction with Go's TLS v1.3 and Google's TLS v1.2.
+       option.WithHTTPClient(&http.Client{
+           Transport: &http.Transport{
+               TLSClientConfig: &tls.Config{
+                   MaxVersion: tls.VersionTLS12,
+               },
+           },
+       }),
+   }
+
    ctx := context.Background()
    if *mode != "dev" {
-       storageClient, err = storage.NewClient(ctx)
+       storageClient, err = storage.NewClient(ctx, opts...)
        if err != nil {
            log.Fatalf("storage.NewClient: %v", err)
        }

-       metricsClient, err = monapi.NewMetricClient(ctx)
+       metricsClient, err = monapi.NewMetricClient(ctx, opts...)
        if err != nil {
            log.Fatalf("monapi.NewMetricClient: %v", err)
        }
    }

-   dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName)
+   dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName, opts...)
    if err != nil {
        if *mode == "dev" {
            log.Printf("Error creating datastore client: %v", err)

or for the full file which is big, please see this section:

```go // Copyright 2015 The Go Authors. All rights reserved. // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. // +build go1.13 // +build linux darwin // Code interacting with Google Compute Engine (GCE) and // a GCE implementation of the BuildletPool interface. package main import ( "context" "crypto/tls" "encoding/json" "errors" "fmt" "io" "io/ioutil" "log" "net/http" "os" "path" "sort" "strconv" "strings" "sync" "time" "cloud.google.com/go/compute/metadata" "cloud.google.com/go/datastore" "cloud.google.com/go/errorreporting" monapi "cloud.google.com/go/monitoring/apiv3" "cloud.google.com/go/storage" "golang.org/x/build/buildenv" "golang.org/x/build/buildlet" "golang.org/x/build/cmd/coordinator/spanlog" "golang.org/x/build/dashboard" "golang.org/x/build/gerrit" "golang.org/x/build/internal/buildgo" "golang.org/x/build/internal/buildstats" "golang.org/x/build/internal/lru" "golang.org/x/oauth2" "golang.org/x/oauth2/google" compute "google.golang.org/api/compute/v1" "google.golang.org/api/googleapi" "google.golang.org/api/option" ) func init() { buildlet.GCEGate = gceAPIGate } // apiCallTicker ticks regularly, preventing us from accidentally making // GCE API calls too quickly. Our quota is 20 QPS, but we temporarily // limit ourselves to less than that. var apiCallTicker = time.NewTicker(time.Second / 10) func gceAPIGate() { <-apiCallTicker.C } // Initialized by initGCE: var ( buildEnv *buildenv.Environment dsClient *datastore.Client computeService *compute.Service gcpCreds *google.Credentials errTryDeps error // non-nil if try bots are disabled gerritClient *gerrit.Client storageClient *storage.Client metricsClient *monapi.MetricClient inStaging bool // are we running in the staging project? (named -dev) errorsClient *errorreporting.Client // Stackdriver errors client gkeNodeIP string initGCECalled bool ) // oAuthHTTPClient is the OAuth2 HTTP client used to make API calls to Google Cloud APIs. // It is initialized by initGCE. var oAuthHTTPClient *http.Client func initGCE() error { initGCECalled = true var err error // If the coordinator is running on a GCE instance and a // buildEnv was not specified with the env flag, set the // buildEnvName to the project ID if *buildEnvName == "" { if *mode == "dev" { *buildEnvName = "dev" } else if metadata.OnGCE() { *buildEnvName, err = metadata.ProjectID() if err != nil { log.Fatalf("metadata.ProjectID: %v", err) } } } buildEnv = buildenv.ByProjectID(*buildEnvName) inStaging = (buildEnv == buildenv.Staging) // If running on GCE, override the zone and static IP, and check service account permissions. if metadata.OnGCE() { projectZone, err := metadata.Get("instance/zone") if err != nil || projectZone == "" { return fmt.Errorf("failed to get current GCE zone: %v", err) } gkeNodeIP, err = metadata.Get("instance/network-interfaces/0/ip") if err != nil { return fmt.Errorf("failed to get current instance IP: %v", err) } // Convert the zone from "projects/1234/zones/us-central1-a" to "us-central1-a". projectZone = path.Base(projectZone) buildEnv.Zone = projectZone if buildEnv.StaticIP == "" { buildEnv.StaticIP, err = metadata.ExternalIP() if err != nil { return fmt.Errorf("ExternalIP: %v", err) } } if !hasComputeScope() { return errors.New("coordinator is not running with access to read and write Compute resources. VM support disabled") } if value, err := metadata.ProjectAttributeValue("farmer-run-bench"); err == nil { *shouldRunBench, _ = strconv.ParseBool(value) } } cfgDump, _ := json.MarshalIndent(buildEnv, "", " ") log.Printf("Loaded configuration %q for project %q:\n%s", *buildEnvName, buildEnv.ProjectName, cfgDump) opts := []option.ClientOption{ // Force TLS 1.2 in the HTTP client because of issues: // * https://github.com/golang/go/issues/31217 // * https://github.com/googleapis/google-cloud-go/issues/1581 // in which there might be a bad interaction with Go's TLS v1.3 and Google's TLS v1.2. option.WithHTTPClient(&http.Client{ Transport: &http.Transport{ TLSClientConfig: &tls.Config{ MaxVersion: tls.VersionTLS12, }, }, }), } ctx := context.Background() if *mode != "dev" { storageClient, err = storage.NewClient(ctx, opts...) if err != nil { log.Fatalf("storage.NewClient: %v", err) } metricsClient, err = monapi.NewMetricClient(ctx, opts...) if err != nil { log.Fatalf("monapi.NewMetricClient: %v", err) } } dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName, opts...) if err != nil { if *mode == "dev" { log.Printf("Error creating datastore client: %v", err) } else { log.Fatalf("Error creating datastore client: %v", err) } } // don't send dev errors to Stackdriver. if *mode != "dev" { errorsClient, err = errorreporting.NewClient(ctx, buildEnv.ProjectName, errorreporting.Config{ ServiceName: "coordinator", }) if err != nil { // don't exit, we still want to run coordinator log.Printf("Error creating errors client: %v", err) } } gcpCreds, err = buildEnv.Credentials(ctx) if err != nil { if *mode == "dev" { // don't try to do anything else with GCE, as it will likely fail return nil } log.Fatalf("failed to get a token source: %v", err) } oAuthHTTPClient = oauth2.NewClient(ctx, gcpCreds.TokenSource) computeService, _ = compute.New(oAuthHTTPClient) errTryDeps = checkTryBuildDeps() if errTryDeps != nil { log.Printf("TryBot builders disabled due to error: %v", errTryDeps) } else { log.Printf("TryBot builders enabled.") } if *mode != "dev" { go syncBuildStatsLoop(buildEnv) } go gcePool.pollQuotaLoop() go createBasepinDisks(context.Background()) return nil } func checkTryBuildDeps() error { if !hasStorageScope() { return errors.New("coordinator's GCE instance lacks the storage service scope") } if *mode == "dev" { return errors.New("running in dev mode") } wr := storageClient.Bucket(buildEnv.LogBucket).Object("hello.txt").NewWriter(context.Background()) fmt.Fprintf(wr, "Hello, world! Coordinator start-up at %v", time.Now()) if err := wr.Close(); err != nil { return fmt.Errorf("test write of a GCS object to bucket %q failed: %v", buildEnv.LogBucket, err) } if inStaging { // Don't expect to write to Gerrit in staging mode. gerritClient = gerrit.NewClient("https://go-review.googlesource.com", gerrit.NoAuth) } else { gobotPass, err := metadata.ProjectAttributeValue("gobot-password") if err != nil { return fmt.Errorf("failed to get project metadata 'gobot-password': %v", err) } gerritClient = gerrit.NewClient("https://go-review.googlesource.com", gerrit.BasicAuth("git-gobot.golang.org", strings.TrimSpace(string(gobotPass)))) } return nil } var gcePool = &gceBuildletPool{} var _ BuildletPool = (*gceBuildletPool)(nil) // maxInstances is a temporary hack because we can't get buildlets to boot // without IPs, and we only have 200 IP addresses. // TODO(bradfitz): remove this once fixed. const maxInstances = 190 type gceBuildletPool struct { mu sync.Mutex // guards all following disabled bool // CPU quota usage & limits. cpuLeft int // dead-reckoning CPUs remain instLeft int // dead-reckoning instances remain instUsage int cpuUsage int addrUsage int inst map[string]time.Time // GCE VM instance name -> creationTime } func (p *gceBuildletPool) pollQuotaLoop() { if computeService == nil { log.Printf("pollQuotaLoop: no GCE access; not checking quota.") return } if buildEnv.ProjectName == "" { log.Printf("pollQuotaLoop: no GCE project name configured; not checking quota.") return } for { p.pollQuota() time.Sleep(5 * time.Second) } } func (p *gceBuildletPool) pollQuota() { gceAPIGate() reg, err := computeService.Regions.Get(buildEnv.ProjectName, buildEnv.Region()).Do() if err != nil { log.Printf("Failed to get quota for %s/%s: %v", buildEnv.ProjectName, buildEnv.Region(), err) return } p.mu.Lock() defer p.mu.Unlock() for _, quota := range reg.Quotas { switch quota.Metric { case "CPUS": p.cpuLeft = int(quota.Limit) - int(quota.Usage) p.cpuUsage = int(quota.Usage) case "INSTANCES": p.instLeft = int(quota.Limit) - int(quota.Usage) p.instUsage = int(quota.Usage) case "IN_USE_ADDRESSES": p.addrUsage = int(quota.Usage) } } } func (p *gceBuildletPool) SetEnabled(enabled bool) { p.mu.Lock() defer p.mu.Unlock() p.disabled = !enabled } func (p *gceBuildletPool) GetBuildlet(ctx context.Context, hostType string, lg logger) (bc *buildlet.Client, err error) { hconf, ok := dashboard.Hosts[hostType] if !ok { return nil, fmt.Errorf("gcepool: unknown host type %q", hostType) } qsp := lg.CreateSpan("awaiting_gce_quota") err = p.awaitVMCountQuota(ctx, hconf.GCENumCPU()) qsp.Done(err) if err != nil { return nil, err } deleteIn, ok := ctx.Value(buildletTimeoutOpt{}).(time.Duration) if !ok { deleteIn = vmDeleteTimeout } instName := "buildlet-" + strings.TrimPrefix(hostType, "host-") + "-rn" + randHex(7) instName = strings.Replace(instName, "_", "-", -1) // Issue 22905; can't use underscores in GCE VMs p.setInstanceUsed(instName, true) gceBuildletSpan := lg.CreateSpan("create_gce_buildlet", instName) defer func() { gceBuildletSpan.Done(err) }() var ( needDelete bool createSpan = lg.CreateSpan("create_gce_instance", instName) waitBuildlet spanlog.Span // made after create is done curSpan = createSpan // either instSpan or waitBuildlet ) log.Printf("Creating GCE VM %q for %s", instName, hostType) bc, err = buildlet.StartNewVM(gcpCreds, buildEnv, instName, hostType, buildlet.VMOpts{ DeleteIn: deleteIn, OnInstanceRequested: func() { log.Printf("GCE VM %q now booting", instName) }, OnInstanceCreated: func() { needDelete = true createSpan.Done(nil) waitBuildlet = lg.CreateSpan("wait_buildlet_start", instName) curSpan = waitBuildlet }, OnGotInstanceInfo: func() { lg.LogEventTime("got_instance_info", "waiting_for_buildlet...") }, }) if err != nil { curSpan.Done(err) log.Printf("Failed to create VM for %s: %v", hostType, err) if needDelete { deleteVM(buildEnv.Zone, instName) p.putVMCountQuota(hconf.GCENumCPU()) } p.setInstanceUsed(instName, false) return nil, err } waitBuildlet.Done(nil) bc.SetDescription("GCE VM: " + instName) bc.SetOnHeartbeatFailure(func() { p.putBuildlet(bc, hostType, instName) }) return bc, nil } func (p *gceBuildletPool) putBuildlet(bc *buildlet.Client, hostType, instName string) error { // TODO(bradfitz): add the buildlet to a freelist (of max N // items) for up to 10 minutes since when it got started if // it's never seen a command execution failure, and we can // wipe all its disk content? (perhaps wipe its disk content // when it's retrieved, like the reverse buildlet pool) But // this will require re-introducing a distinction in the // buildlet client library between Close, Destroy/Halt, and // tracking execution errors. That was all half-baked before // and thus removed. Now Close always destroys everything. deleteVM(buildEnv.Zone, instName) p.setInstanceUsed(instName, false) hconf, ok := dashboard.Hosts[hostType] if !ok { panic("failed to lookup conf") // should've worked if we did it before } p.putVMCountQuota(hconf.GCENumCPU()) return nil } func (p *gceBuildletPool) WriteHTMLStatus(w io.Writer) { fmt.Fprintf(w, "GCE pool capacity: %s", p.capacityString()) const show = 6 // must be even active := p.instancesActive() if len(active) > 0 { fmt.Fprintf(w, "
    ") for i, inst := range active { if i < show/2 || i >= len(active)-(show/2) { fmt.Fprintf(w, "
  • %v, %s
  • \n", inst.name, friendlyDuration(time.Since(inst.creation))) } else if i == show/2 { fmt.Fprintf(w, "
  • ... %d of %d total omitted ...
  • \n", len(active)-show, len(active)) } } fmt.Fprintf(w, "
") } } func (p *gceBuildletPool) String() string { return fmt.Sprintf("GCE pool capacity: %s", p.capacityString()) } func (p *gceBuildletPool) capacityString() string { p.mu.Lock() defer p.mu.Unlock() return fmt.Sprintf("%d/%d instances; %d/%d CPUs", len(p.inst), p.instUsage+p.instLeft, p.cpuUsage, p.cpuUsage+p.cpuLeft) } // awaitVMCountQuota waits for numCPU CPUs of quota to become available, // or returns ctx.Err. func (p *gceBuildletPool) awaitVMCountQuota(ctx context.Context, numCPU int) error { // Poll every 2 seconds, which could be better, but works and // is simple. for { if p.tryAllocateQuota(numCPU) { return nil } select { case <-time.After(2 * time.Second): case <-ctx.Done(): return ctx.Err() } } } func (p *gceBuildletPool) HasCapacity(hostType string) bool { hconf, ok := dashboard.Hosts[hostType] if !ok { return false } numCPU := hconf.GCENumCPU() p.mu.Lock() defer p.mu.Unlock() return p.haveQuotaLocked(numCPU) } // haveQuotaLocked reports whether the current GCE quota permits // starting numCPU more CPUs. // // precondition: p.mu must be held. func (p *gceBuildletPool) haveQuotaLocked(numCPU int) bool { return p.cpuLeft >= numCPU && p.instLeft >= 1 && len(p.inst) < maxInstances && p.addrUsage < maxInstances } func (p *gceBuildletPool) tryAllocateQuota(numCPU int) bool { p.mu.Lock() defer p.mu.Unlock() if p.disabled { return false } if p.haveQuotaLocked(numCPU) { p.cpuUsage += numCPU p.cpuLeft -= numCPU p.instLeft-- p.addrUsage++ return true } return false } // putVMCountQuota adjusts the dead-reckoning of our quota usage by // one instance and cpu CPUs. func (p *gceBuildletPool) putVMCountQuota(cpu int) { p.mu.Lock() defer p.mu.Unlock() p.cpuUsage -= cpu p.cpuLeft += cpu p.instLeft++ } func (p *gceBuildletPool) setInstanceUsed(instName string, used bool) { p.mu.Lock() defer p.mu.Unlock() if p.inst == nil { p.inst = make(map[string]time.Time) } if used { p.inst[instName] = time.Now() } else { delete(p.inst, instName) } } func (p *gceBuildletPool) instanceUsed(instName string) bool { p.mu.Lock() defer p.mu.Unlock() _, ok := p.inst[instName] return ok } func (p *gceBuildletPool) instancesActive() (ret []resourceTime) { p.mu.Lock() defer p.mu.Unlock() for name, create := range p.inst { ret = append(ret, resourceTime{ name: name, creation: create, }) } sort.Sort(byCreationTime(ret)) return ret } // resourceTime is a GCE instance or Kube pod name and its creation time. type resourceTime struct { name string creation time.Time } type byCreationTime []resourceTime func (s byCreationTime) Len() int { return len(s) } func (s byCreationTime) Less(i, j int) bool { return s[i].creation.Before(s[j].creation) } func (s byCreationTime) Swap(i, j int) { s[i], s[j] = s[j], s[i] } // cleanUpOldVMs loops forever and periodically enumerates virtual // machines and deletes those which have expired. // // A VM is considered expired if it has a "delete-at" metadata // attribute having a unix timestamp before the current time. // // This is the safety mechanism to delete VMs which stray from the // normal deleting process. VMs are created to run a single build and // should be shut down by a controlling process. Due to various types // of failures, they might get stranded. To prevent them from getting // stranded and wasting resources forever, we instead set the // "delete-at" metadata attribute on them when created to some time // that's well beyond their expected lifetime. func (p *gceBuildletPool) cleanUpOldVMs() { if *mode == "dev" { return } if computeService == nil { return } // TODO(bradfitz): remove this list and just query it from the compute API? // http://godoc.org/google.golang.org/api/compute/v1#RegionsService.Get // and Region.Zones: http://godoc.org/google.golang.org/api/compute/v1#Region for { for _, zone := range buildEnv.ZonesToClean { if err := p.cleanZoneVMs(zone); err != nil { log.Printf("Error cleaning VMs in zone %q: %v", zone, err) } } time.Sleep(time.Minute) } } // cleanZoneVMs is part of cleanUpOldVMs, operating on a single zone. func (p *gceBuildletPool) cleanZoneVMs(zone string) error { // Fetch the first 500 (default) running instances and clean // thoes. We expect that we'll be running many fewer than // that. Even if we have more, eventually the first 500 will // either end or be cleaned, and then the next call will get a // partially-different 500. // TODO(bradfitz): revist this code if we ever start running // thousands of VMs. gceAPIGate() list, err := computeService.Instances.List(buildEnv.ProjectName, zone).Do() if err != nil { return fmt.Errorf("listing instances: %v", err) } for _, inst := range list.Items { if inst.Metadata == nil { // Defensive. Not seen in practice. continue } var sawDeleteAt bool var deleteReason string for _, it := range inst.Metadata.Items { if it.Key == "delete-at" { if it.Value == nil { log.Printf("missing delete-at value; ignoring") continue } unixDeadline, err := strconv.ParseInt(*it.Value, 10, 64) if err != nil { log.Printf("invalid delete-at value %q seen; ignoring", *it.Value) continue } sawDeleteAt = true if time.Now().Unix() > unixDeadline { deleteReason = "delete-at expiration" } } } isBuildlet := strings.HasPrefix(inst.Name, "buildlet-") if isBuildlet && !sawDeleteAt && !p.instanceUsed(inst.Name) { createdAt, _ := time.Parse(time.RFC3339Nano, inst.CreationTimestamp) if createdAt.Before(time.Now().Add(-3 * time.Hour)) { deleteReason = fmt.Sprintf("no delete-at, created at %s", inst.CreationTimestamp) } } // Delete buildlets (things we made) from previous // generations. Only deleting things starting with "buildlet-" // is a historical restriction, but still fine for paranoia. if deleteReason == "" && sawDeleteAt && isBuildlet && !p.instanceUsed(inst.Name) { if _, ok := deletedVMCache.Get(inst.Name); !ok { deleteReason = "from earlier coordinator generation" } } if deleteReason != "" { log.Printf("deleting VM %q in zone %q; %s ...", inst.Name, zone, deleteReason) deleteVM(zone, inst.Name) } } return nil } var deletedVMCache = lru.New(100) // keyed by instName // deleteVM starts a delete of an instance in a given zone. // // It either returns an operation name (if delete is pending) or the // empty string if the instance didn't exist. func deleteVM(zone, instName string) (operation string, err error) { deletedVMCache.Add(instName, token{}) gceAPIGate() op, err := computeService.Instances.Delete(buildEnv.ProjectName, zone, instName).Do() apiErr, ok := err.(*googleapi.Error) if ok { if apiErr.Code == 404 { return "", nil } } if err != nil { log.Printf("Failed to delete instance %q in zone %q: %v", instName, zone, err) return "", err } log.Printf("Sent request to delete instance %q in zone %q. Operation ID, Name: %v, %v", instName, zone, op.Id, op.Name) return op.Name, nil } func hasScope(want string) bool { // If not on GCE, assume full access if !metadata.OnGCE() { return true } scopes, err := metadata.Scopes("default") if err != nil { log.Printf("failed to query metadata default scopes: %v", err) return false } for _, v := range scopes { if v == want { return true } } return false } func hasComputeScope() bool { return hasScope(compute.ComputeScope) || hasScope(compute.CloudPlatformScope) } func hasStorageScope() bool { return hasScope(storage.ScopeReadWrite) || hasScope(storage.ScopeFullControl) || hasScope(compute.CloudPlatformScope) } func readGCSFile(name string) ([]byte, error) { if *mode == "dev" { b, ok := testFiles[name] if !ok { return nil, &os.PathError{ Op: "open", Path: name, Err: os.ErrNotExist, } } return []byte(b), nil } r, err := storageClient.Bucket(buildEnv.BuildletBucket).Object(name).NewReader(context.Background()) if err != nil { return nil, err } defer r.Close() return ioutil.ReadAll(r) } // syncBuildStatsLoop runs forever in its own goroutine and syncs the // coordinator's datastore Build & Span entities to BigQuery // periodically. func syncBuildStatsLoop(env *buildenv.Environment) { ticker := time.NewTicker(5 * time.Minute) for { ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute) if err := buildstats.SyncBuilds(ctx, env); err != nil { log.Printf("buildstats: SyncBuilds: %v", err) } if err := buildstats.SyncSpans(ctx, env); err != nil { log.Printf("buildstats: SyncSpans: %v", err) } cancel() <-ticker.C } } // createBasepinDisks creates zone-local copies of VM disk images, to // speed up VM creations in the future. // // Other than a list call, this a no-op unless new VM images were // added or updated recently. func createBasepinDisks(ctx context.Context) { if !metadata.OnGCE() || (buildEnv != buildenv.Production && buildEnv != buildenv.Staging) { return } for { t0 := time.Now() bgc, err := buildgo.NewClient(ctx, buildEnv) if err != nil { log.Printf("basepin: NewClient: %v", err) return } log.Printf("basepin: creating basepin disks...") err = bgc.MakeBasepinDisks(ctx) d := time.Since(t0).Round(time.Second / 10) if err != nil { basePinErr.Store(err.Error()) log.Printf("basepin: error creating basepin disks, after %v: %v", d, err) time.Sleep(5 * time.Minute) continue } basePinErr.Store("") log.Printf("basepin: created basepin disks after %v", d) return } } ```
dmitshur commented 5 years ago

Hmm, if the issue is on the cmd/coordinator side (which I or someone else on @golang/osp-team would have to deploy), I wonder why it's seemingly not affecting other builder types.

I'll look more into it on Monday.

Edit: It seems many of the builder configurations have SkipSnapshot set to true, which means few builders are actually doing them, and they might all be running into this error. We started deploying cmd/coordinator with Go 1.13 only recently, which is when TLS 1.3 started being on by default.

mengzhuo commented 5 years ago

@dmitris I found coordinator has been restarted 13 hours ago but my buildlet still can't upload snapshot to GCS. Can I skipsnapshot for my host?

dmitshur commented 5 years ago

I didn't get to this today, but I will try tomorrow morning.

If we can't resolve the snapshot error on the coordinator side, then disabling it sounds reasonable. But I suspect it should be easy to fix. I'll post an update here after trying TLS 1.2.

Also, as a heads up, my GitHub username is @dmitshur. You're pinging the wrong Dmitri. :)

dmitshur commented 5 years ago

@mengzhuo I've deployed coordinator version "318271009812bf18ef7ef1785ecc9dffdbe7ee78-dirty-dmitshur-20190917T083031" just now with TLS 1.2, so we should be able to tell if it makes a difference with the write_snapshot_to_gcs step.