josephg / node-foundationdb

Modern Node.js FoundationDB bindings
Other
115 stars 17 forks source link

Querying subspace end #78

Open keijokapp opened 3 months ago

keijokapp commented 3 months ago

There doesn't seem to be a mechanism to do range operations from an arbitrary key to the end of the subspace. This has been bothering me for years but I've managed to work around or avoid it so far.

I've gone down the rabbit hole trying to understand how subspace ends should be handled in the context of range operations. Maybe it should be a topic for the FDB forum but the issue is at least partially specific to NodeJS bindings.

Problems

  1. Suppose I need keyspace for arbitrary keys (byte strings). Perhaps binary compressed IDs for the users' collection. Any section of the keyspace should be enumerable - from the very beginning to any arbitrary key to the very end. It's currently not possible to query from an arbitrary key to the end of the namespace. At best, I could query until '\xff'.

  2. When using a key encoding, it might not be possible to enumerate even until '\xff'. For example, tuple encoding has no equivalent to '\xff'. At best, I could hope that no actual keys start with a version stamp (0x33) or a UUID (0x30) and use those as the end key.

Workarounds

  1. Use Subspace and .at to remove the configured key encoder and do all packing/unpacking manually (or not configure a key encoder in the first place). That way, the developer can apply strInc to the end key to get the desired behavior.

  2. Use a custom key encoder, similar to a prefix or tuple encoder but with a special value (eg. ES6 symbol) to indicate "end of keyspace". If the encoder encounters that special value, it returns '\xff', otherwise the original key with eg \x00 (or any non-\xff) prefix.

Other bindings

Other bindings avoid the problem with a combination of the following:

  1. Using tuples for everything. Subspace prefixes are always tuples. (Python, Go) Keys are always tuples. (Python, Go). That removes the question about whether keys >=\xff are part of the subspace or not because tuples don't start with \xff.

  2. Expecting users to pack their keys explicitly, enabling them to do strInc or whatever as needed. (Python, Go)

  3. Optional start and end parameters, defaulting to '' and '\xff' respectively. (Python) Though it may or may not be relevant in the context of subspaces.

NodeJS bindings are unique in the sense that they are not specialized for tuple encoding and allow "hiding" an arbitrary key encoder in the subspace, which in turn is "hidden" in the transaction object. That way, the developer can't directly use raw keys.

Keys >='\xff'

There's a forum post explaining that the keyspace foo should not contain entries in the range ['foo\xff', 'fop'). That's not a big problem with the official bindings because keys are expected to be tuples. The official bindings also have inconsistencies (eg. Subspace.contains returning True for a key that wouldn't actually be part of the range). I don't think it's a reasonable limitation. The root \xff keyspace is an abstraction leak which prevents users from using the keyspace for arbitrary keys. The leak makes some sense for the root keyspace but artificially creating it for subspaces does not. In my opinion, the benefits are negligible but the implications are significant. If this is the case, then getRange('') and clearRange('') should apply only to the entries until '\xff' instead of the next key after the keyspace as they currently do. Of course it wouldn't make a difference if there are no keys >='\xff' but it's still an inconsistency.

I think that keys after >='\xff' should be considered first-class citizens in the subspace and have to be fully enumerable just like all other keys.

Solutions

Ideally, the end parameter for getRange and clearRange methods would default to the next key after the subspace instead of after the prefix end. The users could use existing *StartsWith methods to get the current behavior. That's similar to Python bindings. (Also similarly, start could be optional, defaulting to '' independent of the encoder.) Changing this is dangerous though. db.at('foo').clearRange('bar') would have previously cleared only ['foobar', 'foobas') but with the new behavior, it clears ['foobar', 'fop'). A careless developer might not notice a change before deploying to production.

An alternative would be to introduce a special value (in addition to null and undefined which are currently used to indicate "end of prefix") that would be used in a place of end. An ES6 symbol would be fine.

josephg commented 2 months ago

So, the underlying FDB API calls includes a "or equals" flag on each of the ends of the range bounds. So you can differentiate between a range including or excluding the upper bound. Using that, as far as I'm concerned the correct solution to this problem is to clear the range up to the next key - but not including it. I think this gives you correct behaviour in all cases.

db.at('foo').clearRange('bar') would have previously cleared only ['foobar', 'foobas') but with the new behavior, it clears ['foobar', 'fop'). A careless developer might not notice a change before deploying to production.

If the upper bound is specified as "strictly less than fop", then it won't match "fop" and (unless I'm missing something), it should be fine.

I think thats what the existing code does in this case - but I haven't looked at it in a couple years, so I could be mistaken. Thats certainly what it should do.

josephg commented 2 months ago

I just checked, and thats what it does. If you don't pass an end key toclearRange, it generates a range from the subspace prefix by default, using strInc(prefix) to generate the upper bound.

That upper bound is passed to fdb_transaction_clear_range, which is specified like this:

Modify the database snapshot represented by transaction to remove all keys (if any) which are lexicographically greater than or equal to the given begin key and lexicographically less than the given end_key.

(emphasis mine). Apparently fdb_transaction_clear_range doesn't support differentiating between "less than" and "less than or equal to". It always clears keys in the range of [lower, upper) - and thats the exact behaviour we want here.

I don't have a strong opinion about keys starting with \xff. Thats probably something to raise on the FDB forum.

Let me know if there's anything else you want to talk about with this, otherwise I'll close the issue.

keijokapp commented 2 months ago

I just checked, and thats what it does. If you don't pass an end key to clearRange, it generates a range from the subspace prefix by default, using strInc(prefix) to generate the upper bound.

Yes. That's the issue. I need a way to leave end key unspecified as in "end of the subspace" instead of "end of the prefix".

Let's say I have a subspace foo.

I can do tn.at('foo').clearRange('') to clear the whole subspace. But what if I want to clear the whole subspace starting from a specific key, eg \x56? The intended range to clear would be ['foo\x56', 'fop').

  1. Doing tn.at('foo').clearRange('\x56') would be equivalent to clearing ['foo\x56', 'foo\x57'). It does not address keys in the range ['foo\x57', 'fop').
  2. Doing tn.at('foo').clearRange('\x56', '\xff') does not address range ['foo\xff', 'fop'). Of course, I could keep adding \xff-s hoping that there wouldn't be any keys after that but that's often not reasonable.

Other bindings mitigate this problem by, first, putting everything into tuples, so that the developer knows that there are no keys >='\xff'. They could do (pseudocode) fooSubspace.clearRange(tuple.pack(['\x56']), '\xff') and not worry about those entries. Shoving everything into tuples is questionable in my mind on its own but even if I tried using tuples here, I'm hit with another problem:

  1. tn.at('foo', fdb.encoders.tuple).clearRange(['\x56']); would, again, clear just the prefix
  2. There's no tuple equivalent of \xff to be passed as an end key, ie tn.at('foo', fdb.encoders.tuple).clearRange(['\x56'], ???);

I could, of course, do encoding/decoding manually (ie tn.at('foo').clearRange(tuple.pack(['\x56']), '\xff');) but that defeats the abstraction.

Ideally:

  1. tn.at('foo').clearRangeStartsWith('\x56') would clear range ['foo\x56', 'foo\x57'); and
  2. tn.at('foo').clearRange('\x56') would clear range ['foo\x56', 'fop').

That change would be dangerous because, currently, these methods have behavior (1) which acts on a much smaller range than with behavior (2). That would certainly be a major version change.

josephg commented 2 months ago

Ah. That makes sense. Yeah there’s no way I’m changing the behaviour of clearRange like that even in a major version change. The chance of people not reading the release notes and suffering data loss is too high.

But we could add another method for this use case quite easily - something like txn.clearRangeFrom(startKey) which would clear the range from the start key until the end of the subspace.

keijokapp commented 2 months ago

Understandable. I'd propose two other options:

  1. An ES6 symbol as the end key, eg. endOfKeyspaceSymbol. That would look a bit hacky and it's hard to document why this is necessary. But I think it's better than adding new methods. The new methods would need to be added for all methods that work on ranges and there are already a lot of them.
  2. An option on a subspace. The subspace would have an option that defines how the undefined end key behaves. The option propagates to all subspaces created from that subspace. The option could be enabled with a Subspace method similar to how snapshot mode is enabled on Transaction.

I haven't experimented with the second option yet but it seems to be the nicest to me. It's less hacky/bloated. It's seems to be best of both worlds - no backwards compatibility issues and the API could still be clean and intuitive (except the initial method call to enable this mode).

I'd happily implement this. Is that okay?

josephg commented 2 months ago

Hm, those are good ideas. I think I like option 1 the most.

My reasoning is this: Range functions (like clearRange) currently have 2 behaviours depending on the number of parameters:

  1. When you call them with a (start, end) parameter, the range is [start, end).
  2. When you call them with only 1 parameter, the range is all keys which start with the specified prefix.

What you're proposing is adding a 3rd behaviour to the function:

(3.) Call a range function with 1 parameter, specifying the range from [key, end_of_subspace)

(While we're at it, it might make sense to add a function for [start_of_subspace, key) for symmetry - though this is sort of covered by passing ('', key). But thats hacky - since the empty string (in this case) will be interpreted through the transformer function.)

Option 1 proposes that we define a sentinal value which means "the start (or end) of the subspace" in all cases, in order to differentiate between behaviour 2 and 3.

Specifying this behaviour globally (via an option on the subspace) seems weird to me. What if an application wants to sometimes use behaviour 2 and sometimes use behaviour 3? It feels like a decision the user would want to make at the point at which they call clearRange and friends. Deciding this when they configure their subspace locks you in unnecessarily. And how would you know which behaviour you'll want throughout your application?

To me, semantically, it doesn't seem like a property of the subspace. It seems like a property of the explicit range call you're making.

A third option would be to create a separate range type - which would be a more explicit API than using a sentinel value:

txn.clearRange(range.withPrefix("xxx"))
txn.clearRange(range.fromKey("startKey")) // implicitly, until the end of the subspace
txn.clearRange(range.untilKey("endKey")) // Clear from the start of the subspace to "endkey"

The one downside is that this would introduce another temporary object allocation for each call. Probably not a big deal though.

keijokapp commented 2 months ago

What if an application wants to sometimes use behaviour 2 and sometimes use behaviour 3?

They shouldn't. There's no reason to keep using clearRange (and friends) to get behavior 2 when there are StartsWith variants of all those methods. These behaviors seem distinct enough to me that they shouldn't be in the same method. That's what feels natural to me anyway. I've started to use clearRange lately for behavior 2 out of laziness and because I know the underlying semantics but it doesn't feel natural.

The subspace flag wouldn't be global. If the application is big and don't want to migrate to StartsWith variants in one go, the option can be specified on a per operation basis (with some verbosity and overhead) or for specific subspaces that need that behavior. This would be a bit messy but not too terrible IMO.

I'm prioritizing day-to-day developer ergonomics and robustness. (I've always appreciated how this library almost manages to abstract away just the right amount of complexity from the developer and do the right thing. The official bindings are quite basic and opinionated (to tuples) in that sense.) Adding the range API would hurt the ergonomics a bit, even if it's just an overload. The symbols are also awkward IMO. There's also additional complexity with key selectiors involving "start of keyspace" and "end of keyspace". The subspace flag option would improve ergonomics even compared to current state because I could do:

todoTxn.getRange()

to get all "todos" as opposed to todoTxn.getRange([]) that I'd currently need to do; or

function getTodoPage(from?: TodoKey, to?: TodoKey) {
  return txn.at(todoSubspace).getRange(from, to)
}

and get the expected behavior without additional bloat. Though adding ?? startOfKeyspace and ?? endOfKeyspace wouldn't be that hard either.

It might sound silly but if you are not convinced, could we add both features? I think that the sum of complexity and maintenance burden of both options wouldn't be much bigger than with individual options. All the implementation complexity would be encapsulated into Subspace#packRange. There's the burden of additional typings and documentation though. I've made a PR in an anticipation that even if we go with another option, the bulk of code and structure is the same. If that's not an option, I'd like to be able to substitute Subspace class with a custom implementation. I haven't thought it through yet.