palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
44 stars 7 forks source link

Better ability to identify causes of immutableTs failing to increase. #2925

Open dxiao opened 6 years ago

dxiao commented 6 years ago

We've internally had difficulties with the immutableTs stalling out and not increasing for long periods of time (See PDS-62324 for latest incarnation of this).

It would be incredibly helpful if there were active tools we could use to identify the request/transaction, even just the client which is causing keeping the immutableTs from increasing, and how long it's kept it there.

Metrics of how long it's been since the immutableTs changed is nice, but doesn't help with the next step of figuring out what caused it. Bouncing clients are also a band-aid solution which doesn't help find and address the original cause.

hsaraogi commented 6 years ago

We will add an alert on the metric for immutable timestamp tracking. We should have appropriate logging to debug the issue.

gmaretic commented 6 years ago

Seeing this again, causing targeted sweep to fall behind.