golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
121.14k stars 17.37k forks source link

x/telemetry: audit upload process for 1.23 #65970

Open findleyr opened 4 months ago

findleyr commented 4 months ago

We were discussing the potential raciness of telemetry uploading in the context of a thundering herd of Go commands. I looked into this in more detail. Here's a paraphrased summary of the upload process:

  1. Collect all counter file names.
  2. Group by expiry. Ignore expiries that are not in the past (meaning at least a day in the past)
  3. For each group of counter files with the same expiry, compute both "local" and "uploadable" reports, using a random X value that is specific to this operations (i.e. not a hash of report content). This X determines both upload sampling and uploaded report names. Skip counter files that can't be read (e.g. because they've been deleted by one of the following steps...).
  4. Before writing <date>.json (the upload report) and local.<date>.json, check if they already exist. If either of them already exists, assume something else got there first, delete all the counter files, and exit.
  5. Then try to write the report files.
  6. Finally, delete all the counter files.
  7. POST the uploadable report (<date>.json) to telemetry.go.dev/upload/<date>/<X>.json

I'm not concerned about races involving user intervention, such as e.g. deleting the telemetry directory in the middle of an upload. However, it does look like there's a race between steps (4) and (7) that could theoretically lead to duplicate uploads: two processes overwrite each other's report with different X values, and each thinks they should upload. But it looks very unlikely given how much has to happen in the unsynchronized period, and we can probably avoid the race by replacing (4) and (5) with an exclusive write, and not deleting files if that write fails.

Nevertheless, I think we should consider this more.

Bugs discovered:

bcmills commented 4 months ago

I would also be concerned about races between steps (3) and (6), particularly on Windows (where a file normally can't be deleted while another process has an open handle to it).

findleyr commented 4 months ago

@bcmills indeed, if deleting the counter files fails, the error is logged but no other action is taken. However, given that step 6 happens after the report is written, I don't think we have to be worried about this leading to duplicate uploads--that would only be a problem if some other process deleted the local reports, but left the counter files in place. Otherwise, the counter files will "eventually" be deleted because of the logic in step (4): if a report already exists, we still try to delete the counter files.

Are you concerned about having stray counter files around? The race that you describe could lead to A situation where you have e.g.

local/
 go-1.22-linux-amd64-2024-02-26.v1.count
 local.2024-02-26.json

That .v1.count file shouldn't be there, but is "harmless" because it will never be uploaded, because the presence of local.2024-02-26.json will prevent the upload. Furthermore, the counter file would eventually be deleted by step (4). Is that a problem? Aside from cruft, the only downside I foresee is unnecessary work, because reports are needlessly produced during the upload process.

bcmills commented 4 months ago

Yeah — if the error in deleting the file doesn't cause any other ill effects, that's probably an ok failure mode.

findleyr commented 1 month ago

Repurposing this issue to track generally auditing the upload process. Assigning to myself as I'm working on it.

gopherbot commented 1 month ago

Change https://go.dev/cl/584400 mentions this issue: internal/upload: don't skip all uploads on the first report failure

gopherbot commented 1 month ago

Change https://go.dev/cl/584305 mentions this issue: internal/upload: add an end-to-end test for mode handling

gopherbot commented 1 month ago

Change https://go.dev/cl/584402 mentions this issue: internal/counter: remove fall-back logic in Parse for missing TimeBegin

gopherbot commented 1 month ago

Change https://go.dev/cl/584795 mentions this issue: internal/upload: migrate TestUploadBasic and TestUploadFailure

gopherbot commented 1 month ago

Change https://go.dev/cl/586141 mentions this issue: start,internal/upload: add two tests for concurrent upload

gopherbot commented 1 month ago

Change https://go.dev/cl/587197 mentions this issue: internal/upload: make upload.Run concurrency safe