revelc / pyaccumulo

Python Client Library for Apache Accumulo
Apache License 2.0
26 stars 23 forks source link

Expose an interface to efficiently delete rows in a table #5

Closed bauer1j closed 11 years ago

bauer1j commented 11 years ago

I was told that there are three different ways to delete a row that work out of the box, and the custom option:

  1. Delete every entry in the row, one at a time. This requires scanning the row, transforming the results to tombstones, writing those tombstones back to Accumulo, and cleanup happens in the compaction phase.
  2. Use the RowDeletingIterator. This supports using a single tombstone for the whole row, so you don't have to scan the entries first. Cleanup still happens in the compaction phase.
  3. Use range deletion to drop the row. This option works well if you are effectively going to drop everything in an underlying RFile, but can be costly if you're trying to use it as a scalpel. It involves splitting tablets, compacting, and merging tablets.
  4. Write your own iterator that figures out whether parts of a row or whole rows are no longer valid, probably using a technique similar to option #2. This can be good if deleting a row is the side effect of some row-local computation (abstractly speaking).

I was hoping to use option 2 above with pyaccumulo as my current use case involves deleting entire rows.

johnrfrank commented 11 years ago

resolved by https://github.com/accumulo/pyaccumulo/commit/64525939080ee8eeea62e03609e4a971ece9e0c6