[Telemetry] Add anonymous usage and profiling statistics collection

bastienboutonnet commented 2 years ago

Goal

In order to understand the usage, adoption, and performance of our tool better we should collect usage and performance statistics. What we want to achieve with those are the following goals:

Pro-actively track bugs and find and fix errors to get ahead of issues faster
Manage performance to keep our tools fast and secure,
Troubleshoot issues, speed up resolution, and minimize any downtime for users
Understand how you user use our tools to improve user flow and experience

Those statistics must be strictly anonymous, will not pertain to any private information, and will be limited to tracking events that emanate from the tool itself.

What do we consider private and will never track?

Personally identifiable information (names, email addresses, credentials, geolocation, IP, etc.)
Organizations' or Users' Proprietary information (e.g. name of checks, datasets, test data, or test results content)

Target Release: `v2.2.0`

Requirements

Users should be able to opt-out at any point via an easy to configure switch (e.g. YAML variable in config.yaml)
Documentation pages are updated with reference to this GitHub issue along with goal and rationale digest and how to opt-out of the tool.
Telemetry events should not significantly hurt the performance of the tool (i.e. we need to make sure we use an async or non-blocking framework/API for that).

Proposed Solution

Event Collection Framework

Two generally accepted frameworks come to mind:

Open Telemetry: built on open standards, easy to configure, suitable for tracing of performance (traces and spans) with high granularity.
Snowplow: open source option, used by a lot of organizations and reliable, little heavier setup, only made for behavioral events so tracing around performance would have to be shoehorned in their data models.

Event Schema & Attributes

Names and implementations are just ideas. Let's challenge those in this issue and we can update this once we're set.

user_cookie_id: we could generate a UUID and store it in ~/.soda/config.yaml or in a different file in that config folder.
command(_ran): we could get it as a trace name or just add it as an attribute in the attributes key, value object. We should be able to capture this from click or API method name.
command_options: these should be flags used if we're talking about a CLI invocation. Can be captured from click but we might want to find a more internal way of capturing it from method signatures of class attributes so that we are ready for API-driven usage in the future.
version: which version of soda_sql is this coming from: https://github.com/sodadata/soda-sql/blob/main/core/sodasql/__version__.py or whatever else makes sense.
datasource_type: other names could be dialect_id but it lends itself less to soda-spark or other data source types so maybe datasource_type is fine.
datasource_id: given the importance of knowing how many projects and teams are using the tool, we should add a dataserouce_id which should be a hash of the connection parameter that best uniquely identifies a group of people working on the same project.
- Postgres, Redshift, Spark, Presto: host
- Snowflake: account
- BigQuery: project
- etc. to be determined on a dialect/supported data source basis.
- see a similar rationale from dbt

~- run_status: the alternative name could be invocation_status and would allow us to know if a scan, analyze etc. executed successfully or not (exception types should be logged as well).~ ~- metrics_count: number of metrics collected/derived in a scan.~ ~- metrics_invoked: which metrics are people using. NOTE: we should not log custom metrics derivation code (i.e. the SQL query of a custom metric or any formula as this is considered proprietary information). When uses use custom_sql we should just know that they used a custom metric.~ ~- tests_count: count of tests evaluated.~ ~- tests_invoked: same reasoning as metrics list, if the tests are built-in and do not reveal any proprietary logic we should have a list, if not let's leave it out.~

sql_metrics_count: see comment about metrics_count. SQL metrics contain proprietary business logic so we absolutely do not want to track the SQL query itself but just the fact that a custom metric was used.❓
historic_metrics_count ❓
architecture: we can use the platform library of os library to get at which architecture we're running on
operating_system: see above.
python_version
python_implementation
invocation_start: span start/end in the OT framework
invocation_end: same as above.

JCZuurmond commented 2 years ago

Is this going to include the time a specific test/metric took to execute? I do not see it in the schema.

I would like to track this in some way, but it is hard to define what exactly is the time a test/metric takes because we combine multiple metrics into one query or sometimes split it over multiple queries. So what exactly to track I do not know (yet). But it would be nice to investigate if we can provide execution time details on a more granular level than invocation_end - invocation_start

bastienboutonnet commented 2 years ago

Great question @JCZuurmond ! @vijaykiran is still looking into the specifics of the implementation on this so he might have more certain or up-to-date information. As far as I know, the traces schema from Open Telemetry allows creating what they call "spans" which I believe we could attach to specific functions or "subprocesses" of the invocation/orchestrator function. My feeling is that this is indeed a very important thing for us to look into.

Depending on the complexity to implement this, however, we might have the first stab with just global timing and then refine this later.

vijaykiran commented 2 years ago

it is currently not possible to have granular metric calculation time because we aggregate in a query, we can add spans at different levels to print for example how much time each query takes, but it will not give individual metric calculation. If we need that we need to change the execution significantly which might have performance impact in terms of more queries running than needed.

For tests it is certainly possible to have one span per test execution but mostly tests are python expressions so I don't think the span time/diff will be meaningful because we will be just measuring the time taken for "a + b > 4" sort of expressions

JCZuurmond commented 2 years ago

I agree with everything you said, @vijaykiran.

I think we can start with the execution time of the whole scan. Maybe add the time of the individual queries later. Even if we can not exactly determine the execution time of a single metric, I would still like to know how long the SQL takes vs the other parts.

bastienboutonnet commented 2 years ago

Here's an in-depth update on the tracking requirements and their derivation:	attribute_name	derivation
`user_cookie_id`	uuid generated once and stored in `~/.soda/config.yaml`	identifies a unique user across invocations
`command_name`	captured from click's command or API method name
`command_options`	options (CLI flags)	not sure how this works for API usage, where can we get that from in that case
`version`	version.py or other place where we store that in the codebase
`datasource_type`	name of the dialect/warehouse (think also about soda-spark)
`datasource_id`	hash of: - host (Postgres, Redshift, Spark) - account (Snowflake) - project (BigQuery) - and so on for any other adaptors	allows us to group by serveral users to a core data source in order to help defining "usage in production"/identifies a "team"
`sql_metrics_count`	count of custom sql metrics defined	Still a question mark depending on whether we see a real business of technical need.
`historic_metrics_count`	count of usage of the newly introduced `historic_metrics`	Still a question mark depending on whether we see a real business of technical need.
`architecture`	machine architecture (can be obtained from `os` or `platform` libs
`operating_system`
`python_version`
`python_implementation`
`invocation_start`	span start from the OT framework	Will help us derive how long sections of the code take to run. First stab into profiling
`invocation_end`	span end from OT framework	see above

One question still remains open (see note in the remark for invocation_id.

My understanding is that an invocation is a run of a command (i.e. the outermost level of anything below). However, unless I misunderstand how trace and span ids work, it would seem that hashing those would simply result in a surrogate key at the level of a span.

For me a span is a subset of an invocation. In that case would trace_id be what we're looking for @vijaykiran ?

vijaykiran commented 2 years ago

I think we can just use trace_id to identify one "execution" of any soda command or a scan.execute. Each span in a trace will correspond to individual things e.g for a soda analyze warehouse.yml

trace_id :1 
 |--- span 1 (config file reading)
 |---- span 2 (connection to the database)
 |---- span 3 (query for tables)
 |---- span 4 (execute the analyse query )
 ....

does this make sense ?

bastienboutonnet commented 2 years ago

I think it does but to be extra sure does this represent the granularity well:

{
    "trace": {"span"}, {"span"}, ...
}

if so then yes I think trace_id is what we mean by invocation id so we can do away with that hash logic. If that is correct I'll go ahead and update the earlier message.

vijaykiran commented 2 years ago

yes, trace can contain several (hierarchical) spans.

bastienboutonnet commented 2 years ago

Sounds good I updated the table to reflect that change.

m1n0 commented 2 years ago

Couple of thoughts: a) I am not sure whether we need multiple spans at all. From my understanding OT has the concept of spans in order to accommodate for more complex scenarios, like for example:

a request is made to a web service X, trace A is started, together with span A1
web service X talks to another web service Y, this starts another span within trace A, span A2
database Z is invoked and a record is updated, this is represented by span A3 within trace A

In our case I do not see any added value in splitting different parts of soda-sql logic into spans. If we had some subsystems where a subsystem could fail while the whole process would not fail then it would make sense to distinguish these and have separate spans, with separate status for each etc, but this is not our case.

I would suggest to keep things simple, use one span per soda-sql invocation. This will make it much easier to understand the data as well. We can of course expand the complexity later if we want.

b) command_options - I would suggest to distinguish between options and arguments

options are optional, e.g. flags
arguments are mandatory, e.g. warehouse file path

imho its easy to join these into one in analysis, but impossible to split automatically

c) OT does not support dict values, only lists, so I would suggest it would be better to break command_options into separate span attributes with respective suffix, i.e.

"attributes": {
    "cli_command_name": "analyze",
    "command_argument_warehouse_file": "warehouse.yml",
    "command_option_include": "",
    "command_option_exclude": "test",
    "command_option_limit": "",
    "datasource_type": "postgres",
    "datasource_hash": "connection hash"
},

d) platform/os/python info - OT has a predefined set of names and they are separated from span attributes into a resource object. I think this separation does make sense, as platform info is not directly related to specific span arguments, this is what it looks like:

"resource": {
    "telemetry.sdk.language": "python",
    "telemetry.sdk.name": "opentelemetry",
    "telemetry.sdk.version": "1.6.2",
    "os.architecture": [
        "64bit",
        ""
    ],
    "python.version": "3.9.7",
    "python.implementation": "CPython",
    "os.type": "Darwin",
    "os.version": "Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:24 PDT 2021; root:xnu-8019.41.5~1/RELEASE_ARM64_T8101",
    "platform": "macOS-12.0.1-arm64-arm-64bit",
    "service.version": "2.1.0b18",
    "service.name": "soda",
    "service.namespace": "soda-sql"
}

Would you agree that this separation and naming makes sense? We can easily change it of course and push this info as our own span arguments.

e) sql_metrics_count and historic_metrics_count - ok, skipping for now

All of the above is implemented in sodadata/soda-core#563 , it needs review, some testing and polishing.

I attached an example trace from a cli analyze command on postgres

trace.txt

bastienboutonnet commented 2 years ago

Great questions @m1n0

Here's what I think about those: a) one vs multiple spans: I think that indeed for what we are trying to solve now, and the level of granularity at which we are working does not indeed require more than one span. However, @vijaykiran correct me if I'm wrong, if we want to track specific parts of the invocation more granularly (say differentiate between the compute and queries we are making, the parts where we're pushing data to the backend etc. --i.e. move into a more profiling oriented monitoring solution) then spans will make sense. I would agree that for now, we don't need it, and then later on if we want to cover more granular needs we will introduce them. It will be something we will agree on and there will be some "burden" mainly on the Transformation side that will be incurred to know how to properly deal with multiple spans etc. This burden will make sense if we want to track more granularly and is not impossible to solve at all. TLDR: I would say, if we don't need it now, we don't build in that way because it does not make sense and we adjust later as needs come up.

b) I think the main business value we have identified here is in knowing what people use (i.e. how they use the tool with what options etc.) and I agree that there is a difference between the things you list. The difference I think we should make are: what is stuff users do as a choice of usage vs. what they must do but that does not really reflect usage (your example of warehouse file path is more the second case. Behaviourally we don't care where they point to, but we might care that they potentially overide the defaults (if that is possible). If it always needs to be there then we will derive different insights and indeed it's more of an analytical decision.

c) if this is a limitation we have to work with so be it. Analytics will take care of dealing with it. The thing we need to be really good at in that case, is well prefixed and consistent names (as you have in your example) so that at Transformation stage we can parse things in a simple way.

d) I think it makes sense for me too.

e) I would say probably yes. I'll make sure to ping @maartenmasschelein and @mathissedestrooper to help us decide on the business value. Neither I nor @vijaykiran were able to confidently decide on this. I'll for sure get back to you on this as soon as I know but I wouldn't let it block us. We can always have a follow up PR for those which should be trivial to add since our framework will be there and a good test regarding its ease of extensibility.

Looking at the trace you sent I think this format is really workable. The only thing that jumps out now is that we are not yet using the final names but we'll get that sorted in the PR I would say.

sodadata / soda-sql