Open findleyr opened 4 months ago
I would also be concerned about races between steps (3) and (6), particularly on Windows (where a file normally can't be deleted while another process has an open handle to it).
@bcmills indeed, if deleting the counter files fails, the error is logged but no other action is taken. However, given that step 6 happens after the report is written, I don't think we have to be worried about this leading to duplicate uploads--that would only be a problem if some other process deleted the local reports, but left the counter files in place. Otherwise, the counter files will "eventually" be deleted because of the logic in step (4): if a report already exists, we still try to delete the counter files.
Are you concerned about having stray counter files around? The race that you describe could lead to A situation where you have e.g.
local/
go-1.22-linux-amd64-2024-02-26.v1.count
local.2024-02-26.json
That .v1.count
file shouldn't be there, but is "harmless" because it will never be uploaded, because the presence of local.2024-02-26.json
will prevent the upload. Furthermore, the counter file would eventually be deleted by step (4). Is that a problem? Aside from cruft, the only downside I foresee is unnecessary work, because reports are needlessly produced during the upload process.
Yeah — if the error in deleting the file doesn't cause any other ill effects, that's probably an ok failure mode.
Repurposing this issue to track generally auditing the upload process. Assigning to myself as I'm working on it.
Change https://go.dev/cl/584400 mentions this issue: internal/upload: don't skip all uploads on the first report failure
Change https://go.dev/cl/584305 mentions this issue: internal/upload: add an end-to-end test for mode handling
Change https://go.dev/cl/584402 mentions this issue: internal/counter: remove fall-back logic in Parse for missing TimeBegin
Change https://go.dev/cl/584795 mentions this issue: internal/upload: migrate TestUploadBasic and TestUploadFailure
Change https://go.dev/cl/586141 mentions this issue: start,internal/upload: add two tests for concurrent upload
Change https://go.dev/cl/587197 mentions this issue: internal/upload: make upload.Run concurrency safe
We were discussing the potential raciness of telemetry uploading in the context of a thundering herd of Go commands. I looked into this in more detail. Here's a paraphrased summary of the upload process:
X
value that is specific to this operations (i.e. not a hash of report content). ThisX
determines both upload sampling and uploaded report names. Skip counter files that can't be read (e.g. because they've been deleted by one of the following steps...).<date>.json
(the upload report) andlocal.<date>.json
, check if they already exist. If either of them already exists, assume something else got there first, delete all the counter files, and exit.<date>.json
) totelemetry.go.dev/upload/<date>/<X>.json
I'm not concerned about races involving user intervention, such as e.g. deleting the telemetry directory in the middle of an upload. However, it does look like there's a race between steps (4) and (7) that could theoretically lead to duplicate uploads: two processes overwrite each other's report with different
X
values, and each thinks they should upload. But it looks very unlikely given how much has to happen in the unsynchronized period, and we can probably avoid the race by replacing (4) and (5) with an exclusive write, and not deleting files if that write fails.Nevertheless, I think we should consider this more.
Bugs discovered: