scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.58k stars 1.29k forks source link

RFE - Offline SSTable Filtering #13076

Open fee-mendes opened 1 year ago

fee-mendes commented 1 year ago

HEAD: https://github.com/scylladb/scylladb/commit/4b71f8759464b1ac4ad9da589153f0909d1ece9a

As recently seen in the field, we came up with a situation where a cluster had a notably huge partition (>1TB) which started giving problems during background activities such as repair/cleanup/regular compaction. Notably, such huge partitions are highly discouraged and an anti-pattern. However, Scylla today currently has no protection (read guard-rails) mechanism to avoid this situation from happening. The database heavily depends on the user to get its data modeling correct in order to avoid imbalances under a single shard, and failure to do so may end up affecting the entire cluster performance and pose risk at the cluster's availability.

The correct and definitive course of action for evicting a large partition should be to simply delete it with CL=ALL and then run a major in order to replace it with a tombstone. However, there may be situations where one can not readily evict the data (for example, there may be a need to save its data in order to review the data modeling and re-ingest it later), or a scenario where the natural endpoints for the partition in question have changed after a topology change. For the latter example, as @raphaelsc pointed out: one could simply copy the tombstone aforementioned to the previously natural endpoints and proceed with a major. However, at this point, we have to agree that the workaround gets overly complicated from an end user perspective.

Further, as it was seen in this specific situation in the field, the partition grew so huge that there was an immediate need to scale out the cluster in order to avoid a major outage due to nodes running out of disk space. As a result, one may find itself in an emergency situation where it can not scale out and cleanup as fast and its current ingestion grows, or running a major (with its known space amplification) may be simply too risky.

This RFE covers the need for an offline tool capable of acting as a "hammer" when one may hit a dead end (or running against time) and there's a need to filter out specific partitions for specific SSTables. The process would be seemingly similar to what was described by @asias :

  1. Disable compaction temporarily
  2. Copy all sstables contains the giant partition in question to another node
  3. Use a offline tool to manipulate the copied sstables so that all partitions except the giant partition are streamed into new sstables
  4. Use load and stream to inject the new sstables from step 2 back to the cluster
  5. Delete copied sstables from the nodes in the cluster in step 1)
  6. Enable compaction

cc @denesb

denesb commented 1 year ago

I can do something from the very simple all the way to the very advanced: 1) filter based on allow/deny list, partitions are supplied either on the cmd line or via a file 2) static split: sort partitions into buckets, according to the supplied json/yaml partition -> bucket mapping 3) dynamic split: sort partition into buckets, according to the output of a lua script method, which is called for each partition and returns a bucked id (simple integer) 4) dynamic split on the mutation fragment (row) level, something like the TWCS and scrub interposers do, also based on a lua script, like above

I think (2) should be enough, but possibly even just (1) would be good for a start. What do you think?

fee-mendes commented 1 year ago

I think (2) should be enough, but possibly even just (1) would be good for a start. What do you think?

If the question is directed to me. 1 would be more than enough already.

denesb commented 1 year ago

I think (2) should be enough, but possibly even just (1) would be good for a start. What do you think?

If the question is directed to me. 1 would be more than enough already.

That should be really simple to do.

fee-mendes commented 1 year ago

It just occurred me (although this may be kinda obvious). When I initially opened this issue I wanted a way to filter out a given partition from a SSTable set. However, we may also want to filter in only a specific partition from a set. Consider a different scenario: We need to ask access to data for a specific partition only, and we are not interested in the remaining partitions, plus we want to avoid concerns of users sharing additional data we don't necessarily need as part of an investigation.

raphaelsc commented 1 year ago

If we proceed with this tool, it would be important to not replace the input SSTables, but rather provide the output SSTables into a designated location. So it's safe mode by default. So user can decide to backup the original files in case anything goes wrong, like user accidently discarding more partitions than needed.

denesb commented 1 year ago

It just occurred me (although this may be kinda obvious). When I initially opened this issue I wanted a way to filter out a given partition from a SSTable set. However, we may also want to filter in only a specific partition from a set. Consider a different scenario: We need to ask access to data for a specific partition only, and we are not interested in the remaining partitions, plus we want to avoid concerns of users sharing additional data we don't necessarily need as part of an investigation.

This was part of my proposal, notice I was talking about accept/deny lists even in option (1). Accept list would include only the specified partitions, why a deny would do the reverse: include only those not on the list.

fee-mendes commented 1 year ago

If we proceed with this tool, it would be important to not replace the input SSTables, but rather provide the output SSTables into a designated location. So it's safe mode by default. So user can decide to backup the original files in case anything goes wrong, like user accidently discarding more partitions than needed.

Yes, that's the general idea of it. Scrub takes a snapshot by default before playing with the data, the concept should be fairly the same.

denesb commented 1 year ago

The sstable tool never modifies its input sstables.

raphaelsc commented 1 year ago

this issue is full of heart and like emojis, I both love and like it.

denesb commented 1 year ago

the sstable tool brings people together