sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.29k forks source link

Strip private data before pushing to BigQuery #20628

Closed asdine closed 9 months ago

asdine commented 3 years ago

The LogEvent method is used to send events to both BigQuery and to the events table. We need to stop ~storing~ /~ sending the url and arguments column as they contain private user information and are not used for analytics anymore. ~Ensure no other part of the codebase is using these columns, through the EventStore) or through reading the table directly.~

Ensure that we do not send these to BigQuery anymore.

Cf: RFC-358

indradhanush commented 3 years ago

A summary of what I've been upto today:

I made a change to stop writing URL and Argument to the event_logs table and naively "fixed" the tests that were still writing to it here: https://github.com/sourcegraph/sourcegraph/compare/ig/private-analytics?expand=1

That's when I realized from my failing builds (should have checked first) that the columns are not nulllable. As a result I had to change my approach to writing an empty string instead. Here's what I suggest for a smooth zero-downtime approach:

  1. Write empty strings to these two columns and stop reading from the columns
  2. In the next release, we write a DB migration to make the fields nullable
  3. In the subsequent release, we permanently stop writing to the field
  4. Finally we drop the columns from the database

We could maybe merge 1, 2 and 3 in the same step but I fear this could lead to some write errors when the db migration is in progress while the code has already been updated to stop writing to these columns. I could be wrong about this as I do not have more insight into the deployment order. Also if these potential errors are not a huge concern then we should definitely merge these three steps into one.

Point number 1 above isn't without its pain points however. Here are the following list of touch points (AFAIK) which rely on reading data from the argument column and my feeling is that

  1. GetCodeInsightsUsageStatistics - Possibly owned by @sourcegraph/code-insights.
  2. aggregatedSearchEventsQuery - Not sure who owns this.
  3. GetExtensionsUsageStatistics - Not sure who owns this. Maybe @sourcegraph/code-intel ?
  4. eventLogsCountsQuery - Possibly owned by @sourcegraph/batchers
  5. aggregatedCodeIntelEventsQuery - Possibly owned by @sourcegraph/code-intel.

Here's what I have so far in this branch (different from the previous link since I had to change my approach): https://github.com/sourcegraph/sourcegraph/compare/ig/private-analytics-url-args?expand=1

Some tests are rightly failing here which I believe is a symptom of the above touch points relying on the argument column for their queries while the change now writes an empty {} to it.

mrnugget commented 3 years ago

4. eventLogsCountsQuery - Possibly owned by @sourcegraph/batchers

We don't use the URL of event_logs, but we do use the arguments column to store "metadata" along with the event: https://sourcegraph.com/github.com/sourcegraph/sourcegraph@40c5c3d665983f879465daa32f7d0154db024791/-/blob/enterprise/internal/batches/resolvers/resolver.go#L99-114

But I don't see how that is private data?

felixfbecker commented 3 years ago

Code insights doesn't make use of the url to my knowledge. The arguments are important but contain non-sensitive data. We'll make sure to not add sensitive data to these for the metrics we add in the future as well.

I didn't actually know we tracked the real url. I've seen in other cases we instead match the route pattern (like literally /insights/:insightId) which would not be confidential, so that may be a good thing to replace any instances with that do track/make use of the URL.

aggregatedSearchEventsQuery - Not sure who owns this.

I'd guess @sourcegraph/search-core or @sourcegraph/search-product

GetExtensionsUsageStatistics - Not sure who owns this. Maybe @sourcegraph/code-intel ?

That would be @sourcegraph/extensibility (specifically @tjkandala worked on extension metrics I think)

rvantonder commented 3 years ago

aggregatedSearchEventsQuery - Not sure who owns this.

As someone who is catching up on what is going on in our metrics: No one owns this. It is an abandoned piece of our metrics code that should be owned and not abandoned.

We need to stop storing / sending the url and arguments column as they contain private user information and are not used for analytics anymore.

For search usage metrics, this argument column is what tracks things considered non-sensitive statistics (use of filters like "repo"). So I am not sure that "are not used for analytics anymore" is accurate. It is the case that this data is not currently used because we stopped sending this data back in pings due to a regression that I'm still investigating: https://github.com/sourcegraph/sourcegraph/issues/20813

Summary: Is it accurate that argument contains sensitive/private data? My reading is that this is an important place where we log non-sensitive data (and data should be sanitized). It should not be the case that this is "unused". If the proposal is that this should not be used, then we will invariably need some other place to store the non-sensitive statistics above. My guess is that it should stay and we should enforce the use of argument to be non-sensitive data.

FYI, this front-end code sanitizes queries that end up in the argument column for search queries. I can't speak for other uses of the argument column for code-intel, etc.

Strum355 commented 3 years ago

To add the code-intel perspective to the eternal archive that is the issue tracker:

we currently actively use event_logs for 2 purposes (that I am aware of)

  1. Determining autoindex candidate repos based on the ratio of search-based to precise code intel event rows in event_logs for the given repo.
    • Today, we use both url and argument for this. I hacked on this today to no longer require the url column anymore, instead bundling the repository ID in the argument column. This change is not live yet and will be pending Erics review.
  2. The languageId field of the json data stored in argument column is used to populate Looker to give us a language breakdown of the search based vs precise code intel results for analytics and product planning etc.
    • Not sure the data source of Looker, all I know is the data is exported periodically from postgres to somewhere. BigQuery?
rvantonder commented 3 years ago

We need to stop storing / sending the url and arguments column as they contain private user information and are not used for analytics anymore. Ensure no other part of the codebase is using these columns, through the EventStore) or through reading the table directly.

@asdine can you please clarify whether this issue aims to strip/sanitize logged event data for argument only for private code, or only for cloud, or entirely? I can't easily tell from RFC 358. As an FYI, if we strip argument values for any of these, it will basically remove a subset of search "pings" data we care about as listed in RFC 358. This might be fine, or not, depending on what the reviewers in RFC 358 agree on.

rvantonder commented 3 years ago

@Strum355 about (2), because I happened to look at this, yeah, it flows to big query via this schema (language_id): https://github.com/sourcegraph/sourcegraph/pull/15635

efritz commented 3 years ago

This seems like a huge hammer.

ryanslade commented 3 years ago

This seems like a huge hammer.

Could you clarify what you mean please? Are you referring to the issue as a whole or the PR linked above?

efritz commented 3 years ago

Sorry, the conversation about this was split in too many places. I'm still getting context on this.