Closed asdine closed 9 months ago
A summary of what I've been upto today:
I made a change to stop writing URL
and Argument
to the event_logs
table and naively "fixed" the tests that were still writing to it here: https://github.com/sourcegraph/sourcegraph/compare/ig/private-analytics?expand=1
That's when I realized from my failing builds (should have checked first) that the columns are not nulllable
. As a result I had to change my approach to writing an empty string instead. Here's what I suggest for a smooth zero-downtime approach:
We could maybe merge 1
, 2
and 3
in the same step but I fear this could lead to some write errors when the db migration is in progress while the code has already been updated to stop writing to these columns. I could be wrong about this as I do not have more insight into the deployment order. Also if these potential errors are not a huge concern then we should definitely merge these three steps into one.
Point number 1
above isn't without its pain points however. Here are the following list of touch points (AFAIK) which rely on reading data from the argument
column and my feeling is that
Here's what I have so far in this branch (different from the previous link since I had to change my approach): https://github.com/sourcegraph/sourcegraph/compare/ig/private-analytics-url-args?expand=1
Some tests are rightly failing here which I believe is a symptom of the above touch points relying on the argument
column for their queries while the change now writes an empty {}
to it.
4. eventLogsCountsQuery - Possibly owned by @sourcegraph/batchers
We don't use the URL
of event_logs
, but we do use the arguments
column to store "metadata" along with the event: https://sourcegraph.com/github.com/sourcegraph/sourcegraph@40c5c3d665983f879465daa32f7d0154db024791/-/blob/enterprise/internal/batches/resolvers/resolver.go#L99-114
But I don't see how that is private data?
Code insights doesn't make use of the url
to my knowledge. The arguments
are important but contain non-sensitive data. We'll make sure to not add sensitive data to these for the metrics we add in the future as well.
I didn't actually know we tracked the real url
. I've seen in other cases we instead match the route pattern (like literally /insights/:insightId
) which would not be confidential, so that may be a good thing to replace any instances with that do track/make use of the URL.
aggregatedSearchEventsQuery - Not sure who owns this.
I'd guess @sourcegraph/search-core or @sourcegraph/search-product
GetExtensionsUsageStatistics - Not sure who owns this. Maybe @sourcegraph/code-intel ?
That would be @sourcegraph/extensibility (specifically @tjkandala worked on extension metrics I think)
aggregatedSearchEventsQuery - Not sure who owns this.
As someone who is catching up on what is going on in our metrics: No one owns this. It is an abandoned piece of our metrics code that should be owned and not abandoned.
We need to stop storing / sending the url and arguments column as they contain private user information and are not used for analytics anymore.
For search usage metrics, this argument
column is what tracks things considered non-sensitive statistics (use of filters like "repo"). So I am not sure that "are not used for analytics anymore" is accurate. It is the case that this data is not currently used because we stopped sending this data back in pings due to a regression that I'm still investigating: https://github.com/sourcegraph/sourcegraph/issues/20813
Summary: Is it accurate that argument
contains sensitive/private data? My reading is that this is an important place where we log non-sensitive data (and data should be sanitized). It should not be the case that this is "unused". If the proposal is that this should not be used, then we will invariably need some other place to store the non-sensitive statistics above. My guess is that it should stay and we should enforce the use of argument
to be non-sensitive data.
FYI, this front-end code sanitizes queries that end up in the argument
column for search queries. I can't speak for other uses of the argument
column for code-intel, etc.
To add the code-intel perspective to the eternal archive that is the issue tracker:
we currently actively use event_logs for 2 purposes (that I am aware of)
url
and argument
for this. I hacked on this today to no longer require the url
column anymore, instead bundling the repository ID in the argument
column. This change is not live yet and will be pending Erics review.languageId
field of the json data stored in argument
column is used to populate Looker to give us a language breakdown of the search based vs precise code intel results for analytics and product planning etc.
We need to stop storing / sending the url and arguments column as they contain private user information and are not used for analytics anymore. Ensure no other part of the codebase is using these columns, through the EventStore) or through reading the table directly.
@asdine can you please clarify whether this issue aims to strip/sanitize logged event data for argument
only for private code, or only for cloud, or entirely? I can't easily tell from RFC 358. As an FYI, if we strip argument
values for any of these, it will basically remove a subset of search "pings" data we care about as listed in RFC 358. This might be fine, or not, depending on what the reviewers in RFC 358 agree on.
@Strum355 about (2), because I happened to look at this, yeah, it flows to big query via this schema (language_id
): https://github.com/sourcegraph/sourcegraph/pull/15635
This seems like a huge hammer.
This seems like a huge hammer.
Could you clarify what you mean please? Are you referring to the issue as a whole or the PR linked above?
Sorry, the conversation about this was split in too many places. I'm still getting context on this.
The LogEvent method is used to send events to both BigQuery and to the
events
table. We need to stop ~storing~ /~ sending theurl
andarguments
column as they contain private user information and are not used for analytics anymore. ~Ensure no other part of the codebase is using these columns, through the EventStore) or through reading the table directly.~Ensure that we do not send these to BigQuery anymore.
Cf: RFC-358