Open joshmoore opened 3 months ago
We do have http total (cumulative) latency: "/tensorstore/http/total_time_ms" I could add a few more to capture latency of various kvstore drivers.
Several other drivers also include read_latency_ms / write_latency_ms, which are histograms and should give a good summary of read/write latency. I can make some of these more consistent, but not all kvstores will support all metrics.
You probably want ts.experimental_collect_matching_metrics('/tensorstore/')
, it may not make a difference now, but we may add common prometheus metrics at some point.
I could add a few more to capture latency of various kvstore drivers.
:100: on having the total wherever possible.
but not all kvstores will support all metrics.
Understood, but a vote for total_time_ms
to be as wide-spread as possible.
You probably want
ts.experimental_collect_matching_metrics('/tensorstore/')
:+1:
With https://github.com/google/tensorstore/commit/cc9fb6641b2b661d3569e1b16004af1c024251ad I made the kvstore metrics a bit more consistent.
I think that the latency metrics are pretty close to what you want:
/tensorstore/kvstore/driver/read_latency_ms /tensorstore/kvstore/driver/write_latency_ms
They will report buckets as well as aggregate latency values.
Thanks, @laramiel! I've updated and now see:
/tensorstore/kvstore/file/read_latency_ms [{'0': 0, '1': 1, '2': 0, '3': 0, '4': 1, 'count': 2, 'mean': 2.5, 'sum_of_squared_deviation': 12.5}]
I might be misinterpreting but what I would like to achieve is something of the form:
read: ~152.78ms @ 3.60GB/s
I can get the delta in read size from the total (by tracking the previous value) but with the aggregates for time, I can't, right?
The total time is count * mean. Here it should be 5. Then you need to look at the /tensorstore/kvstore/driver/bytes_read metric.
So, for example, if I have this:
/tensorstore/kvstore/file/batch_read=71804
/tensorstore/kvstore/file/bytes_read=301163610386
/tensorstore/kvstore/file/open_read=150897
/tensorstore/kvstore/file/read=150897
/tensorstore/kvstore/file/read_latency_ms={count=71804 mean=7.40619 buckets=[0, 50,1967,13997,30356,21320,3724,214,82,43,42, 9]}
Then my total read time = 71804 * 7.40619 / 1000 ) = 531.79 seconds
And my average throughput is 301163610386 / (1024 * 1024) / 531.79 = 540.0853725367284 MB/s
But also the latency value may not be quite what we want; the start time is recorded somewhat before os ::read
call, so it includes some queue latency and some other delays.
The total time is count * mean. Here it should be 5.
:+1: Understood and thanks.
But also the latency value may not be quite what we want; the start time is recorded somewhat before os
::read
call, so it includes some queue latency and some other delays.
Ah, this is good to know. I'm not overly concerned, but if there's a ever a different property that need to test, just let me know. :+1:
In working on converting from OME-Zarr filesets from Zarr v2 to Zarr v3, I've been using zarr-python for the group and metadata migration and tensorstore for the large array conversion. The metrics API has been a welcome tool. Kudos and :+1: for it being non-experimental.
I have a few suggestions, though, ignoring how practical they may be to implement 🙂
Store
implementation that returns the bytes_{read,written} etc. metrics with a generic prefix.You can find some of these ideas loosely implemented in the simple
TSMetrics
class below:Example usage: