palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
44 stars 7 forks source link

Oracle Shrink Timeouts on medium sized tables. #2461

Open hsaraogi opened 6 years ago

hsaraogi commented 6 years ago

As per PDS-56588, the biggest issue right now is that a SHRINK taking longer than 5 minutes is causing a socket timeout. This makes sweep slower and does not recover DB space.

We can consider the following options:

  1. we make the shrink & compact async and retry when it fails. (Retry works only if its not rollbacked)
  2. Increase the timeouts for SHRINK operations. This might use a lot of undo space.
  3. Atlas maintains a queue for tables requiring compaction and run SHRINK's in parallel to free up space faster. This might also use a lot of undo space. ----- probably not an option at all.
hsaraogi commented 6 years ago

I was testing this and found the following:

Oracle SHRINK times out on the client side but keeps running on the server.

ALTER TABLE A_NS__PT_KVS_TEST_0000 SHRINK SPACE COMPACT;
=> socket read timeout on the client side
=> operation still running on the db side
Error code: ORA-10634
Description: Segment is already being shrunk

This runs the risk of us running multiple SHRINKs at the same time which can result in huge undo space requirements thus stalling any hotpaths that might need UNDO space.

Plan:

Have a single threaded executor that does the following:

run Oracle shrink on just swept table say sweptTable
while (true) {
    try {
         run Oracle shrink on just swept table say sweptTable
    } catch(ORA-10634) {
                  continue;
    }
    break;
}