palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
44 stars 7 forks source link

Tombstone Tracing #7125

Closed jeremyk-91 closed 1 month ago

jeremyk-91 commented 1 month ago

General

Before this PR: We don't really have telemetry when things are deleted.

After this PR:

==COMMIT_MSG== On tables the user finds interesting, we can log when we attempt to apply deletions. ==COMMIT_MSG==

Priority: high P2, relevant in some other high priority investigations

Concerns / possible downsides (what feedback would you like?):

Is documentation needed?: No

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?: No

Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?: No

The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.): Yes

Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?: No

Does this PR need a schema migration? No

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?: Configs are safe. I think that's always a thing we hold.

What was existing testing like? What have you done to improve it?: I haven't šŸ¤®

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.: N/A

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?: N/A

Execution

How would I tell this PR works in production? (Metrics, logs, etc.): Logs are produced correctly

Has the safety of all log arguments been decided correctly?: I believe so but please verify

Will this change significantly affect our spending on metrics or logs?: Probably not, it's restricted to a small selection of user-configured tables

How would I tell that this PR does not work in production? (monitors, etc.): Logs aren't produced correctly.

If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?: Rollback. If something turned out to be leaked, follow the usual process.

If the above plan is more complex than ā€œrecall and rollbackā€, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.: No

Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?: No

Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?: Just don't put this on tables with high throughput

Development Process

Where should we start reviewing?: Small

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:

Please tag any other people who should be aware of this PR: @jeremyk-91 @sverma30 @raiju