Closed bastienboutonnet closed 2 years ago
Is this going to include the time a specific test/metric took to execute? I do not see it in the schema.
I would like to track this in some way, but it is hard to define what exactly is the time a test/metric takes because we combine multiple metrics into one query or sometimes split it over multiple queries. So what exactly to track I do not know (yet). But it would be nice to investigate if we can provide execution time details on a more granular level than invocation_end
- invocation_start
Great question @JCZuurmond ! @vijaykiran is still looking into the specifics of the implementation on this so he might have more certain or up-to-date information. As far as I know, the traces schema from Open Telemetry allows creating what they call "spans" which I believe we could attach to specific functions or "subprocesses" of the invocation/orchestrator function. My feeling is that this is indeed a very important thing for us to look into.
Depending on the complexity to implement this, however, we might have the first stab with just global timing and then refine this later.
it is currently not possible to have granular metric calculation time because we aggregate in a query, we can add spans at different levels to print for example how much time each query takes, but it will not give individual metric calculation. If we need that we need to change the execution significantly which might have performance impact in terms of more queries running than needed.
For tests it is certainly possible to have one span per test execution but mostly tests are python expressions so I don't think the span time/diff will be meaningful because we will be just measuring the time taken for "a + b > 4" sort of expressions
I agree with everything you said, @vijaykiran.
I think we can start with the execution time of the whole scan. Maybe add the time of the individual queries later. Even if we can not exactly determine the execution time of a single metric, I would still like to know how long the SQL takes vs the other parts.
Here's an in-depth update on the tracking requirements and their derivation: | attribute_name | derivation | remarks |
---|---|---|---|
user_cookie_id |
uuid generated once and stored in ~/.soda/config.yaml |
identifies a unique user across invocations | |
command_name |
captured from click's command or API method name | ||
command_options |
options (CLI flags) | not sure how this works for API usage, where can we get that from in that case | |
version |
version.py or other place where we store that in the codebase | ||
datasource_type |
name of the dialect/warehouse (think also about soda-spark) | ||
datasource_id |
hash of: - host (Postgres, Redshift, Spark) - account (Snowflake) - project (BigQuery) - and so on for any other adaptors | allows us to group by serveral users to a core data source in order to help defining "usage in production"/identifies a "team" | |
sql_metrics_count |
count of custom sql metrics defined | Still a question mark depending on whether we see a real business of technical need. | |
historic_metrics_count |
count of usage of the newly introduced historic_metrics |
Still a question mark depending on whether we see a real business of technical need. | |
architecture |
machine architecture (can be obtained from os or platform libs |
||
operating_system |
|||
python_version |
|||
python_implementation |
|||
invocation_start |
span start from the OT framework | Will help us derive how long sections of the code take to run. First stab into profiling | |
invocation_end |
span end from OT framework | see above |
One question still remains open (see note in the remark for invocation_id
.
My understanding is that an invocation
is a run of a command (i.e. the outermost level of anything below). However, unless I misunderstand how trace and span ids work, it would seem that hashing those would simply result in a surrogate key at the level of a span.
For me a span is a subset of an invocation. In that case would trace_id
be what we're looking for @vijaykiran ?
I think we can just use trace_id
to identify one "execution" of any soda command or a scan.execute. Each span in a trace will correspond to individual things e.g for a soda analyze warehouse.yml
trace_id :1
|--- span 1 (config file reading)
|---- span 2 (connection to the database)
|---- span 3 (query for tables)
|---- span 4 (execute the analyse query )
....
does this make sense ?
I think it does but to be extra sure does this represent the granularity well:
{
"trace": {"span"}, {"span"}, ...
}
if so then yes I think trace_id
is what we mean by invocation id so we can do away with that hash logic. If that is correct I'll go ahead and update the earlier message.
yes, trace can contain several (hierarchical) spans.
Sounds good I updated the table to reflect that change.
Couple of thoughts: a) I am not sure whether we need multiple spans at all. From my understanding OT has the concept of spans in order to accommodate for more complex scenarios, like for example:
In our case I do not see any added value in splitting different parts of soda-sql logic into spans. If we had some subsystems where a subsystem could fail while the whole process would not fail then it would make sense to distinguish these and have separate spans, with separate status for each etc, but this is not our case.
I would suggest to keep things simple, use one span per soda-sql invocation. This will make it much easier to understand the data as well. We can of course expand the complexity later if we want.
b) command_options
- I would suggest to distinguish between options and arguments
imho its easy to join these into one in analysis, but impossible to split automatically
c) OT does not support dict values, only lists, so I would suggest it would be better to break command_options
into separate span attributes with respective suffix, i.e.
"attributes": {
"cli_command_name": "analyze",
"command_argument_warehouse_file": "warehouse.yml",
"command_option_include": "",
"command_option_exclude": "test",
"command_option_limit": "",
"datasource_type": "postgres",
"datasource_hash": "connection hash"
},
d) platform/os/python info - OT has a predefined set of names and they are separated from span attributes into a resource
object. I think this separation does make sense, as platform info is not directly related to specific span arguments, this is what it looks like:
"resource": {
"telemetry.sdk.language": "python",
"telemetry.sdk.name": "opentelemetry",
"telemetry.sdk.version": "1.6.2",
"os.architecture": [
"64bit",
""
],
"python.version": "3.9.7",
"python.implementation": "CPython",
"os.type": "Darwin",
"os.version": "Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:24 PDT 2021; root:xnu-8019.41.5~1/RELEASE_ARM64_T8101",
"platform": "macOS-12.0.1-arm64-arm-64bit",
"service.version": "2.1.0b18",
"service.name": "soda",
"service.namespace": "soda-sql"
}
Would you agree that this separation and naming makes sense? We can easily change it of course and push this info as our own span arguments.
e) sql_metrics_count
and historic_metrics_count
- ok, skipping for now
All of the above is implemented in sodadata/soda-core#563 , it needs review, some testing and polishing.
I attached an example trace from a cli analyze command on postgres
Great questions @m1n0
Here's what I think about those: a) one vs multiple spans: I think that indeed for what we are trying to solve now, and the level of granularity at which we are working does not indeed require more than one span. However, @vijaykiran correct me if I'm wrong, if we want to track specific parts of the invocation more granularly (say differentiate between the compute and queries we are making, the parts where we're pushing data to the backend etc. --i.e. move into a more profiling oriented monitoring solution) then spans will make sense. I would agree that for now, we don't need it, and then later on if we want to cover more granular needs we will introduce them. It will be something we will agree on and there will be some "burden" mainly on the Transformation side that will be incurred to know how to properly deal with multiple spans etc. This burden will make sense if we want to track more granularly and is not impossible to solve at all. TLDR: I would say, if we don't need it now, we don't build in that way because it does not make sense and we adjust later as needs come up.
b) I think the main business value we have identified here is in knowing what people use (i.e. how they use the tool with what options etc.) and I agree that there is a difference between the things you list. The difference I think we should make are: what is stuff users do as a choice of usage vs. what they must do but that does not really reflect usage (your example of warehouse file path is more the second case. Behaviourally we don't care where they point to, but we might care that they potentially overide the defaults (if that is possible). If it always needs to be there then we will derive different insights and indeed it's more of an analytical decision.
c) if this is a limitation we have to work with so be it. Analytics will take care of dealing with it. The thing we need to be really good at in that case, is well prefixed and consistent names (as you have in your example) so that at Transformation stage we can parse things in a simple way.
d) I think it makes sense for me too.
e) I would say probably yes. I'll make sure to ping @maartenmasschelein and @mathissedestrooper to help us decide on the business value. Neither I nor @vijaykiran were able to confidently decide on this. I'll for sure get back to you on this as soon as I know but I wouldn't let it block us. We can always have a follow up PR for those which should be trivial to add since our framework will be there and a good test regarding its ease of extensibility.
Looking at the trace you sent I think this format is really workable. The only thing that jumps out now is that we are not yet using the final names but we'll get that sorted in the PR I would say.
Goal
In order to understand the usage, adoption, and performance of our tool better we should collect usage and performance statistics. What we want to achieve with those are the following goals:
Those statistics must be strictly anonymous, will not pertain to any private information, and will be limited to tracking events that emanate from the tool itself.
What do we consider private and will never track?
Target Release:
v2.2.0
Requirements
config.yaml
)Proposed Solution
Event Collection Framework
Two generally accepted frameworks come to mind:
Open Telemetry: built on open standards, easy to configure, suitable for tracing of performance (traces and spans) with high granularity.
Snowplow: open source option, used by a lot of organizations and reliable, little heavier setup, only made for behavioral events so tracing around performance would have to be shoehorned in their data models.
Event Schema & Attributes
Names and implementations are just ideas. Let's challenge those in this issue and we can update this once we're set.
user_cookie_id
: we could generate a UUID and store it in~/.soda/config.yaml
or in a different file in that config folder.command(_ran)
: we could get it as a trace name or just add it as an attribute in the attributes key, value object. We should be able to capture this fromclick
or API method name.command_options
: these should be flags used if we're talking about a CLI invocation. Can be captured fromclick
but we might want to find a more internal way of capturing it from method signatures of class attributes so that we are ready for API-driven usage in the future.version
: which version of soda_sql is this coming from: https://github.com/sodadata/soda-sql/blob/main/core/sodasql/__version__.py or whatever else makes sense.datasource_type
: other names could bedialect_id
but it lends itself less tosoda-spark
or other data source types so maybe datasource_type is fine.datasource_id
: given the importance of knowing how many projects and teams are using the tool, we should add adataserouce_id
which should be a hash of the connection parameter that best uniquely identifies a group of people working on the same project.~-
run_status
: the alternative name could beinvocation_status
and would allow us to know if ascan
,analyze
etc. executed successfully or not (exception types should be logged as well).~ ~-metrics_count
: number of metrics collected/derived in a scan.~ ~-metrics_invoked
: which metrics are people using. NOTE: we should not log custom metrics derivation code (i.e. the SQL query of a custom metric or any formula as this is considered proprietary information). When uses usecustom_sql
we should just know that they used a custom metric.~ ~-tests_count
: count of tests evaluated.~ ~-tests_invoked
: same reasoning as metrics list, if the tests are built-in and do not reveal any proprietary logic we should have a list, if not let's leave it out.~sql_metrics_count
: see comment aboutmetrics_count
. SQL metrics contain proprietary business logic so we absolutely do not want to track the SQL query itself but just the fact that a custom metric was used.❓historic_metrics_count
❓architecture
: we can use theplatform
library ofos
library to get at which architecture we're running onoperating_system
: see above.python_version
python_implementation
invocation_start
: span start/end in the OT frameworkinvocation_end
: same as above.