Add metrics for S3 sstables backend

xemul commented 1 year ago

The metrics in question partially mimic those for IO classes:

scylla_s3_total_read_requests is that same as scylla_io_queue_total_read_ops
scylla_s3_total_write_requests is that same as scylla_io_queue_total_write_ops
scylla_s3_total_read_bytes is that same as scylla_io_queue_total_read_bytes
scylla_s3_total_write_bytes is that same as scylla_io_queue_total_write_bytes

Partially extends them

scylla_s3_total_read_latency_sec is that same as scylla_io_queue_total_delay_sec but only for reads
scylla_s3_total_write_latency_sec is that same as scylla_io_queue_total_delay_sec but only for writes

(the above latencies should be divided by corresponding *_requests to show per-request latency (like in #1714))

And partially introduces its own

scylla_s3_nr_connections gauge shows the total number of established connections
scylla_s3_nr_active_connections gauge shows the total number of established connections that are serving http request. Respectively nr_connections - nr_active_connections value would show the number of connections in pool
scylla_s3_total_new_connections counter shows the number of newly established connections. The faster it grows, the worse, because it means that old connections are dropped for whatever reason and client opens new ones, which is costly

All metrics are per-{scheduling-class , target-endpoint} pair. In most of the cases there will be just one "endpoint" label value.

amnonh commented 1 year ago

WhatsApp scylladb versions this is applicable to? OS and enterprise

On Monday, October 30, 2023, Pavel Emelyanov @.***> wrote:

The metrics in question partially mimic those for IO classes:

scylla_s3_total_read_requests is that same as scylla_io_queue_total_read_ops

scylla_s3_total_write_requests is that same as scylla_io_queue_total_write_ops

scylla_s3_total_read_bytes is that same as scylla_io_queue_total_read_bytes

scylla_s3_total_write_bytes is that same as scylla_io_queue_total_write_bytes

Partially extends them

scylla_s3_total_read_latency_sec is that same as scylla_io_queue_total_delay_sec but only for reads

scylla_s3_total_write_latency_sec is that same as scylla_io_queue_total_delay_sec but only for writes

(the above latencies should be divided by corresponding *_requests to show per-request latency (like in #1714 https://github.com/scylladb/scylla-monitoring/issues/1714))

And partially introduces its own

scylla_s3_nr_connections gauge shows the total number of established connections

scylla_s3_nr_active_connections gauge shows the total number of established connections that are serving http request. Respectively nr_connections

nr_active_connections value would show the number of connections in pool

scylla_s3_total_new_connections counter shows the number of newly established connections. The faster it grows, the worse, because it means that old connections are dropped for whatever reason and client opens new ones, which is costly

All metrics are per-{scheduling-class , target-endpoint} pair. In most of the cases there will be just one "endpoint" label value.

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-monitoring/issues/2103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQFDP77IBGO23LQW6XNTHDYB7KSDAVCNFSM6AAAAAA6WKT36SVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DQOBQGM4DQMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

xemul commented 1 year ago

It's going to be in 5.4 (under --experimental option)

amnonh commented 1 year ago

And what enterprise version?

On Monday, October 30, 2023, Pavel Emelyanov @.***> wrote:

It's going to be in 5.4 (under --experimental option)

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-monitoring/issues/2103#issuecomment-1785720199, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQFDPYXZ7ZJQ4CUNR52DQLYB7P37AVCNFSM6AAAAAA6WKT36SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBVG4ZDAMJZHE . You are receiving this because you commented.Message ID: @.***>

xemul commented 1 year ago

And what enterprise version?

I think none, even 5.4 is not released yet

amnonh commented 1 year ago

@xemul it must be part of a future enterprise release

xemul commented 1 year ago

Yes, then it's likely going to be the 2024.1 version?

mykaul commented 1 year ago

Yes, then it's likely going to be the 2024.1 version?

Yes.

amnonh commented 1 year ago

@xemul how can I test this?

xemul commented 1 year ago

@amnonh , here's the doc how to set Scylla for that: https://github.com/scylladb/scylladb/blob/master/docs/dev/object_storage.md

You'll also need to either create a bucket on AWS S3, or start minio server

amnonh commented 1 year ago

@xemul I tried to test it with scylladb running on docker 5.4.0-rc1 with minio running in a container

No metrics were created, so I tried to run nodetool flush and got:

INFO  2023-11-08 11:03:27,301 [shard 0:main] commitlog_replayer - Replaying /var/lib/scylla/commitlog/CommitLog-2-153046609.log, /var/lib/scylla/commitlog/CommitLog-2-153046608.log, /var/lib/scylla/commitlog/CommitLog-2-153041438.log, /var/lib/scylla/commitlog/Recycled-CommitLog-2-152480867.log, /var/lib/scylla/commitlog/CommitLog-2-153051779.log, /var/lib/scylla/commitlog/CommitLog-2-152480869.log, /var/lib/scylla/commitlog/CommitLog-2-152480863.log, /var/lib/scylla/commitlog/Recycled-CommitLog-2-152480868.log, /var/lib/scylla/commitlog/Recycled-CommitLog-2-152480866.log, /var/lib/scylla/commitlog/CommitLog-2-153051780.log, /var/lib/scylla/commitlog/CommitLog-2-152480864.log, /var/lib/scylla/commitlog/CommitLog-2-153041439.log
INFO  2023-11-08 11:03:27,356 [shard 0:main] commitlog_replayer - Log replay complete, 23778 replayed mutations (0 invalid, 0 skipped)
INFO  2023-11-08 11:03:27,356 [shard 0:main] init - replaying commit log - flushing memtables
ERROR 2023-11-08 11:03:27,357 [shard 0:main] table - failed to write sstable /var/lib/scylla/data/testing/prepared-30dfb1007e2511eeab867459b658559d/me-3gaw_0upr_24awx2dwafnblnqrel-big-Data.db: seastar::httpd::unexpected_status_error (Unexpected reply status)
ERROR 2023-11-08 11:03:27,357 [shard 0:main] table - Memtable flush failed due to: seastar::httpd::unexpected_status_error (Unexpected reply status). Aborting, at 0x5f987ae 0x5f98d70 0x5f99048 0x1c10c16 0x138d51a 0x5a9bb0f 0x5a9cde7 0x5a9c159 0x5a3ea37 0x5a3dbec 0x1310e4e 0x13128b0 0x130f3bc /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x130cde4
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception<replica::dirty_memory_manager::flush_one(replica::memtable_list&, replica::flush_permit&&)::$_0>(replica::dirty_memory_manager::flush_one(replica::memtable_list&, replica::flush_permit&&)::$_0&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception<replica::dirty_memory_manager::flush_one(replica::memtable_list&, replica::flush_permit&&)::$_0>(replica::dirty_memory_manager::flush_one(replica::memtable_list&, replica::flush_permit&&)::$_0&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<replica::dirty_memory_manager::flush_one(replica::memtable_list&, replica::flush_permit&&)::$_0>(replica::dirty_memory_manager::flush_one(replica::memtable_list&, replica::flush_permit&&)::$_0&&)::{lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::handle_exception<replica::dirty_memory_manager::flush_one(replica::memtable_list&, replica::flush_permit&&)::$_0>(auto:1&&)::{lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<replica::memtable_list::flush()::$_0::operator()<replica::flush_permit>(replica::flush_permit) const::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<replica::memtable_list::flush()::$_0::operator()<replica::flush_permit>(replica::flush_permit) const::{lambda()#1}, false> >(seastar::future<void>::finally_body<replica::memtable_list::flush()::$_0::operator()<replica::flush_permit>(replica::flush_permit) const::{lambda()#1}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<replica::memtable_list::flush()::$_0::operator()<replica::flush_permit>(auto:1) const::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<void>&&)#1}, seastar::future<void>::then_wrapped_nrvo<void, seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<void>&&)#1}>(seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<void>&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<void>&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
Aborting on shard 0.
Backtrace:
  0x5a8a258
  0x5ac0242
  /opt/scylladb/libreloc/libc.so.6+0x3dbaf
  /opt/scylladb/libreloc/libc.so.6+0x8e883
  /opt/scylladb/libreloc/libc.so.6+0x3dafd
  /opt/scylladb/libreloc/libc.so.6+0x2687e
  0x1c10c49
  0x138d51a
  0x5a9bb0f
  0x5a9cde7
  0x5a9c159
  0x5a3ea37
  0x5a3dbec
  0x1310e4e
  0x13128b0
  0x130f3bc
  /opt/scylladb/libreloc/libc.so.6+0x27b89
  /opt/scylladb/libreloc/libc.so.6+0x27c4a
  0x130cde4
2023-11-08 11:03:30,441 INFO exited: scylla (terminated by SIGABRT (core dumped); not expected)

xemul commented 1 year ago

INFO 2023-11-08 11:03:27,301 [shard 0:main] commitlog_replayer - Replaying ...

It's a message from early boot, AFAIK, not from flush. Do you have full log? Other than that -- this typically arrives when minio or object_storage.yaml is misconfigured. You can get more elaborated error code into logs with --logger-log-level http=debug option (until scylladb/seastar#1931)

amnonh commented 1 year ago

I'm postponding it untill we'll have a stable version with a step-by-step instruction of how to test it, or even better, when it will be covered by QA tests

amnonh commented 12 months ago

@xemul I would be happy to bring it back to monitoring 4.6 but there should be a better way to test it

amnonh commented 11 months ago

@xemul ping

xemul commented 11 months ago

@amnonh , the issue you stepped on here is not something we fixed explicitly. Could you provide more details on what the problem was?

Also

... when it will be covered by QA tests

there's a unit test that starts scylla and populates it with S3-backed data -- pytest test/object_store/test_basic.py, would that work for you?

amnonh commented 11 months ago

@xemul I just need a way to see those metrics in action, if you can add a step-by-step instructions on how to get those metrics, it's good enough

xemul commented 11 months ago

@amnonh , the https://github.com/scylladb/scylladb/blob/master/docs/dev/object_storage.md is the instructions. Me, @tchaikov and object_store test all follow them :) I know that you tried them too and it ended up with unexpected reply, and I'm ready to help debugging it. The --logger-log-level http=debug and full logs (as I wrote -- the commitlog replaying message is not from flush) are to start with

amnonh commented 11 months ago

@xemul I'm refering to this: terminated by SIGABRT (core dumped) when I'm trying to write to minio I can try using AWS bucket, if there is a test QA are already using and I can get metrics samples from there it's fine.

If possible, I prefer not doing testing, but use a working code

xemul commented 11 months ago

@xemul I'm refering to this: terminated by SIGABRT (core dumped) when I'm trying to write to minio I can try using AWS bucket, if there is a test QA are already using and I can get metrics samples from there it's fine.

That's still about misconfiguration. In recent version of scylla instead of SIGABRT you'll get a message about inability to flush and scheduled attempt to write later (scylladb/scylladb#13745), but that won't help much I think

If possible, I prefer not doing testing, but use a working code

It should properly configured and there are tests that prove it works if such. If you can show me a cluster that's not working I'd be happy to check it and fix configuration for you

amnonh commented 9 months ago

@xemul is there a QA test we can run?

amnonh commented 5 months ago

@xemul I would like it to be part of 4.8 is there a QA test I can use to run and get some metrics examples?

xemul commented 5 months ago

@amnonh , nothing had changed since last time. There's a unit-test in the tree, but not more than that