sodadata / soda-sql

Soda SQL and Soda Spark have been deprecated and replaced by Soda Core. docs.soda.io/soda-core/overview.html
https://docs.soda.io/
Apache License 2.0
59 stars 16 forks source link

[Telemetry] Add anonymous usage and profiling statistics collection #146

Closed bastienboutonnet closed 2 years ago

bastienboutonnet commented 2 years ago

Goal

In order to understand the usage, adoption, and performance of our tool better we should collect usage and performance statistics. What we want to achieve with those are the following goals:

Those statistics must be strictly anonymous, will not pertain to any private information, and will be limited to tracking events that emanate from the tool itself.

What do we consider private and will never track?

Target Release: v2.2.0

Requirements

Proposed Solution

Event Collection Framework

Two generally accepted frameworks come to mind:

Event Schema & Attributes

Names and implementations are just ideas. Let's challenge those in this issue and we can update this once we're set.

~- run_status: the alternative name could be invocation_status and would allow us to know if a scan, analyze etc. executed successfully or not (exception types should be logged as well).~ ~- metrics_count: number of metrics collected/derived in a scan.~ ~- metrics_invoked: which metrics are people using. NOTE: we should not log custom metrics derivation code (i.e. the SQL query of a custom metric or any formula as this is considered proprietary information). When uses use custom_sql we should just know that they used a custom metric.~ ~- tests_count: count of tests evaluated.~ ~- tests_invoked: same reasoning as metrics list, if the tests are built-in and do not reveal any proprietary logic we should have a list, if not let's leave it out.~

JCZuurmond commented 2 years ago

Is this going to include the time a specific test/metric took to execute? I do not see it in the schema.

I would like to track this in some way, but it is hard to define what exactly is the time a test/metric takes because we combine multiple metrics into one query or sometimes split it over multiple queries. So what exactly to track I do not know (yet). But it would be nice to investigate if we can provide execution time details on a more granular level than invocation_end - invocation_start

bastienboutonnet commented 2 years ago

Great question @JCZuurmond ! @vijaykiran is still looking into the specifics of the implementation on this so he might have more certain or up-to-date information. As far as I know, the traces schema from Open Telemetry allows creating what they call "spans" which I believe we could attach to specific functions or "subprocesses" of the invocation/orchestrator function. My feeling is that this is indeed a very important thing for us to look into.

Depending on the complexity to implement this, however, we might have the first stab with just global timing and then refine this later.

vijaykiran commented 2 years ago

it is currently not possible to have granular metric calculation time because we aggregate in a query, we can add spans at different levels to print for example how much time each query takes, but it will not give individual metric calculation. If we need that we need to change the execution significantly which might have performance impact in terms of more queries running than needed.

For tests it is certainly possible to have one span per test execution but mostly tests are python expressions so I don't think the span time/diff will be meaningful because we will be just measuring the time taken for "a + b > 4" sort of expressions

JCZuurmond commented 2 years ago

I agree with everything you said, @vijaykiran.

I think we can start with the execution time of the whole scan. Maybe add the time of the individual queries later. Even if we can not exactly determine the execution time of a single metric, I would still like to know how long the SQL takes vs the other parts.

bastienboutonnet commented 2 years ago
Here's an in-depth update on the tracking requirements and their derivation: attribute_name derivation remarks
user_cookie_id uuid generated once and stored in ~/.soda/config.yaml identifies a unique user across invocations
command_name captured from click's command or API method name
command_options options (CLI flags) not sure how this works for API usage, where can we get that from in that case
version version.py or other place where we store that in the codebase
datasource_type name of the dialect/warehouse (think also about soda-spark)
datasource_id hash of: - host (Postgres, Redshift, Spark) - account (Snowflake) - project (BigQuery) - and so on for any other adaptors allows us to group by serveral users to a core data source in order to help defining "usage in production"/identifies a "team"
sql_metrics_count count of custom sql metrics defined Still a question mark depending on whether we see a real business of technical need.
historic_metrics_count count of usage of the newly introduced historic_metrics Still a question mark depending on whether we see a real business of technical need.
architecture machine architecture (can be obtained from os or platform libs
operating_system
python_version
python_implementation
invocation_start span start from the OT framework Will help us derive how long sections of the code take to run. First stab into profiling
invocation_end span end from OT framework see above

One question still remains open (see note in the remark for invocation_id.

My understanding is that an invocation is a run of a command (i.e. the outermost level of anything below). However, unless I misunderstand how trace and span ids work, it would seem that hashing those would simply result in a surrogate key at the level of a span.

For me a span is a subset of an invocation. In that case would trace_id be what we're looking for @vijaykiran ?

vijaykiran commented 2 years ago

I think we can just use trace_id to identify one "execution" of any soda command or a scan.execute. Each span in a trace will correspond to individual things e.g for a soda analyze warehouse.yml

trace_id :1 
 |--- span 1 (config file reading)
 |---- span 2 (connection to the database)
 |---- span 3 (query for tables)
 |---- span 4 (execute the analyse query )
 ....

does this make sense ?

bastienboutonnet commented 2 years ago

I think it does but to be extra sure does this represent the granularity well:

{
    "trace": {"span"}, {"span"}, ...
}

if so then yes I think trace_id is what we mean by invocation id so we can do away with that hash logic. If that is correct I'll go ahead and update the earlier message.

vijaykiran commented 2 years ago

yes, trace can contain several (hierarchical) spans.

bastienboutonnet commented 2 years ago

Sounds good I updated the table to reflect that change.

m1n0 commented 2 years ago

Couple of thoughts: a) I am not sure whether we need multiple spans at all. From my understanding OT has the concept of spans in order to accommodate for more complex scenarios, like for example:

  1. a request is made to a web service X, trace A is started, together with span A1
  2. web service X talks to another web service Y, this starts another span within trace A, span A2
  3. database Z is invoked and a record is updated, this is represented by span A3 within trace A

In our case I do not see any added value in splitting different parts of soda-sql logic into spans. If we had some subsystems where a subsystem could fail while the whole process would not fail then it would make sense to distinguish these and have separate spans, with separate status for each etc, but this is not our case.

I would suggest to keep things simple, use one span per soda-sql invocation. This will make it much easier to understand the data as well. We can of course expand the complexity later if we want.

b) command_options - I would suggest to distinguish between options and arguments

imho its easy to join these into one in analysis, but impossible to split automatically

c) OT does not support dict values, only lists, so I would suggest it would be better to break command_options into separate span attributes with respective suffix, i.e.

"attributes": {
    "cli_command_name": "analyze",
    "command_argument_warehouse_file": "warehouse.yml",
    "command_option_include": "",
    "command_option_exclude": "test",
    "command_option_limit": "",
    "datasource_type": "postgres",
    "datasource_hash": "connection hash"
},

d) platform/os/python info - OT has a predefined set of names and they are separated from span attributes into a resource object. I think this separation does make sense, as platform info is not directly related to specific span arguments, this is what it looks like:

"resource": {
    "telemetry.sdk.language": "python",
    "telemetry.sdk.name": "opentelemetry",
    "telemetry.sdk.version": "1.6.2",
    "os.architecture": [
        "64bit",
        ""
    ],
    "python.version": "3.9.7",
    "python.implementation": "CPython",
    "os.type": "Darwin",
    "os.version": "Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:24 PDT 2021; root:xnu-8019.41.5~1/RELEASE_ARM64_T8101",
    "platform": "macOS-12.0.1-arm64-arm-64bit",
    "service.version": "2.1.0b18",
    "service.name": "soda",
    "service.namespace": "soda-sql"
}

Would you agree that this separation and naming makes sense? We can easily change it of course and push this info as our own span arguments.

e) sql_metrics_count and historic_metrics_count - ok, skipping for now

All of the above is implemented in sodadata/soda-core#563 , it needs review, some testing and polishing.

I attached an example trace from a cli analyze command on postgres

trace.txt

bastienboutonnet commented 2 years ago

Great questions @m1n0

Here's what I think about those: a) one vs multiple spans: I think that indeed for what we are trying to solve now, and the level of granularity at which we are working does not indeed require more than one span. However, @vijaykiran correct me if I'm wrong, if we want to track specific parts of the invocation more granularly (say differentiate between the compute and queries we are making, the parts where we're pushing data to the backend etc. --i.e. move into a more profiling oriented monitoring solution) then spans will make sense. I would agree that for now, we don't need it, and then later on if we want to cover more granular needs we will introduce them. It will be something we will agree on and there will be some "burden" mainly on the Transformation side that will be incurred to know how to properly deal with multiple spans etc. This burden will make sense if we want to track more granularly and is not impossible to solve at all. TLDR: I would say, if we don't need it now, we don't build in that way because it does not make sense and we adjust later as needs come up.

b) I think the main business value we have identified here is in knowing what people use (i.e. how they use the tool with what options etc.) and I agree that there is a difference between the things you list. The difference I think we should make are: what is stuff users do as a choice of usage vs. what they must do but that does not really reflect usage (your example of warehouse file path is more the second case. Behaviourally we don't care where they point to, but we might care that they potentially overide the defaults (if that is possible). If it always needs to be there then we will derive different insights and indeed it's more of an analytical decision.

c) if this is a limitation we have to work with so be it. Analytics will take care of dealing with it. The thing we need to be really good at in that case, is well prefixed and consistent names (as you have in your example) so that at Transformation stage we can parse things in a simple way.

d) I think it makes sense for me too.

e) I would say probably yes. I'll make sure to ping @maartenmasschelein and @mathissedestrooper to help us decide on the business value. Neither I nor @vijaykiran were able to confidently decide on this. I'll for sure get back to you on this as soon as I know but I wouldn't let it block us. We can always have a follow up PR for those which should be trivial to add since our framework will be there and a good test regarding its ease of extensibility.

Looking at the trace you sent I think this format is really workable. The only thing that jumps out now is that we are not yet using the final names but we'll get that sorted in the PR I would say.