Indexes followup - Githubissues

aboodman commented 3 years ago

P1: blocks clean, correct indexing; in rough order of priority

[x] only index strings
[x] write.clear() needs to update indexes
[x] scan start and prefix need to be properly serialized
[x] add more unit (eg, scan after write, !scan after drop), embed tests
[x] Rebuild index during sync and add sync indexing tests
[x] address batching bug, maybe by removing batching or by fixing it, hard to say at the moment
[x] index arrays, maybe just arrays of strings
[x] prevent index change mutations from being sent to the server: https://github.com/rocicorp/repc/issues/257
[x] figure out how to return primary key for index scans
- aaron says: We must return the secondary key if we want to implement the current (batched) scan rpc. Of course it's always possible to change it. Changes to what is returned through scan would necessitate probably changes to the JS API. Not a reason to avoid, just pointing it out.
- fritz note 10/30: i know how to do this, will get to it
- aaron note 10/31: arv removed the batching from scan so that part of my comment above is now irrelevant fyi
[x] be able to scan index by primary as well: https://github.com/rocicorp/repc/issues/259
[ ] figure out what to do about the fact that you can put() non-json values: https://github.com/rocicorp/repc/issues/251
[x] additional work: some of the above has to be reflected into js sdk

P2: nice to have; will revisit which of these to do once P1s are complete

[x] make ScanOptionsInternal nicer, see TODO added in https://github.com/rocicorp/repc/pull/250
[x] maybe move decoding of index keys to db::Read? https://github.com/rocicorp/repc/blob/73a2dfc9bcdf131424cad45cc3acae14be8b1f1f/src/embed/connection.rs#L481
[x] get rid of unwrap()s in scan: https://github.com/rocicorp/repc/blob/73a2dfc9bcdf131424cad45cc3acae14be8b1f1f/src/embed/connection.rs#L496
[ ] parallelize writes where easy
[ ] delay loading main map (aka "primary index" in db terms) too -- if we protect it by a mutex in the same way we can share code btwn the primary and secondary indexes. also for many use cases, users won't even need the main map, we're loading it for no reason.
[ ] evaluate json pointer more efficiently
[ ] use iterator for indexes
[x] make exclusive required
[x] scanbound.start.id should be called key
[ ] fix the method that is worked around
[x] startKey needs to accept both secondary and primary key

TODOs to remove:

[ ] turn this could be DeRef into a comment, not TODO

aboodman commented 3 years ago

I have a draft patch for 'clear' here: https://github.com/rocicorp/repc/pull/229/commits/b7719e6f0ac48f05ff5a69799ec9d5f1ec42b48f. Not going to land it @phritz you can take it as a starting point or not.

phritz commented 3 years ago

Edit: content moved to top-level comment.

~~OK trying to formulate the master list of indexing followups. aaron has a good start above but I'd like to round it out with TODOs from the code and issues filed recently. @aboodman @arv please:~~

say your opinion on the following open question: do we reduce scope to only index strings? I say YES. I think we should keep using bytekey in order to facilitate change or expansion of key types.
Look over the list below and tell me what if anything is missing or should be shifted around. Once we have agreement I will replace the top comment with the list and get started.

P1: blocks clean, correct indexing; in rough order of priority

only index strings if we decide that's the route to go. suggest doing it first because it reduces amount of indexing code.
(exists above) write.clear() needs to update indexes
scan start and prefix need to be properly serialized
(exists above) add more unit (eg, scan after write, !scan after drop), embed tests
(exists above) Rebuild index during sync and add sync indexing tests
address batching bug, maybe by removing batching or by fixing it, hard to say at the moment
index arrays, maybe just arrays of strings
(exists above) prevent index change mutations from being sent to the server
figure out how to return primary key for index scans
- aaron says: We must return the secondary key if we want to implement the current (batched) scan rpc. Of course it's always possible to change it. Changes to what is returned through scan would necessitate probably changes to the JS API. Not a reason to avoid, just pointing it out.
additional work: some of the above has to be reflected into js sdk

P2: nice to have; will revisit which of these to do once P1s are complete

(exists above) parallelize writes where easy
(exists above) delay loading main map (aka "primary index" in db terms) too -- if we protect it by a mutex in the same way we can share code btwn the primary and secondary indexes. also for many use cases, users won't even need the main map, we're loading it for no reason.
(exists above) evaluate json pointer more efficiently
(exists above) use iterator for indexes
make exclusive required
scanbound.start.id should be called key
fix the method that is worked around

TODOs to remove:

turn this could be DeRef into a comment, not TODO

aboodman commented 3 years ago

say your opinion on the following open question: do we reduce scope to only index strings? I say YES. I think we should keep using bytekey in order to facilitate change or expansion of key types.

yes.

Look over the list below and tell me what if anything is missing or should be shifted around. Once we have agreement I will replace the top comment with the list and get started.

lgtm, just some comments below:

address batching bug, maybe by removing batching or by fixing it, hard to say at the moment

Erik and I talked about this a bit today at the playground. Not sure what the right path is.

fix the method that does not work

The code currently works around this method not working. I left it in hoping to make it work later, as it would be more consistent with the same method in MapReadGuard.

figure out how to return primary key for index scans (I think we should def do this)

We must return the secondary key if we want to implement the current (batched) scan rpc. Of course it's always possible to change it. Changes to what is returned through scan would necessitate probably changes to the JS API. Not a reason to avoid, just pointing it out.

phritz commented 3 years ago

Thanks. Moved list to top level comment and will fold in anything arv says there.

phritz commented 3 years ago

figure out how to return primary key for index scans (I think we should def do this)

We must return the secondary key if we want to implement the current (batched) scan rpc. Of course it's always possible to change it. Changes to what is returned through scan would necessitate probably changes to the JS API. Not a reason to avoid, just pointing it out.

Seems like we have a few options:

Since we are only indexing strings for now we could roll our own index key serialization that enabled us to recover the primary and/or secondary keys from the index key. I think we should NOT do this because it is brittle -- we end up having to re-invent bytekey is we want to add additional types.
Be able to recover the primary and/or secondary keys from the prolly map entry in the index map. That is, store a struct as the val of an index map entry and have this struct include primary and/or secondary index values. I think we should NOT do this because it violates DRY (the data are in the key).
Recover the primary and/or secondary keys from the index map by deserializing them from they key. The key is the bytekey-encoded IndexKey. We can deserialize the IndexKey struct https://docs.rs/bytekey-fix/0.5.0/bytekey_fix/de/index.html as long as we know the types, which we do because for now the secondary key is always a string. In the future if we index more than strings we can know the type using the index definition. I think we SHOULD do this, keeping bytekey and our value wrapper enums in case we want to index additional types in the future.

So question 1: should we do 3 above?

Also: I would like to propose that we make the scan item work more consistently across a regular scan and an indexed scan. Right now we send the receiver a prolly::Entry which is a key and val of [u8] https://github.com/rocicorp/repc/blob/fd413d04f5dc9d30ec4f4061c7411cc995beebbb/src/prolly/mod.rs#L11. However:

'key' is overloaded: it will the primary key in the normal scan case but the secondary key in the index scan case. Confusing.
there is no distinction between primary and secondary index key values in Entry
seems like it would be a good thing to separate the structure of the thing we store in the prolly tree from the thing we give to js as a result of scan. Those two probably shouldn't have the same type.

So question 2: I'm wondering if we should introduce a new type to pass to js instead of prolly::Entry that makes the two cases more consistent:

// ScanResult is a single result from a scan. Typically a scan will yield many
// ScanResults.
pub struct ScanResult {
  // key is the primary key of the result (ie, key in the value map). key will always be set.
  pub key: [u8],

  // secondary_key is set if an index was used for the scan and contains the value of the
  // secondary key that was indexed. At present, this is the UTF-8 bytes of the string value
  // that was indexed (repc strips any wrapping it applied internally).
  pub secondary_key: [u8],

  // val is the full value at key above.
  pub val: [u8],
}

aboodman commented 3 years ago

On Thu, Oct 22, 2020 at 11:57 AM Phritz notifications@github.com wrote:

Since we are only indexing strings for now we could roll our own index key serialization that enabled us to recover the primary and/or secondary keys from the index key. I think we should NOT do this because it is brittle -- we end up having to re-invent bytekey is we want to add additional types.

Be able to recover the primary and/or secondary keys from the prolly map entry in the index map. That is, store a struct as the val of an index map entry and have this struct include primary and/or secondary index values. I think we should NOT do this because it violates DRY (the data are in the key).

Recover the primary and/or secondary keys from the index map by deserializing them from they key. The key is the bytekey-encoded IndexKey. We can deserialize the IndexKey struct https://docs.rs/bytekey-fix/0.5.0/bytekey_fix/de/index.html as long as we know the types, which we do because for now the secondary key is always a string. In the future if we index more than strings we can know the type using the index definition. I think we SHOULD do this, keeping bytekey and our value wrapper enums in case we want to index additional types in the future.

I vote (3) but I thought that since IndexKey.IndexValue is an enum, we don't need to do anything else to deserialize it correctly. That is, we don't need to do:

we can know the type using the index definition.

in case we want to index additional types in the future

agree

So question 1: should we do 3 above?

Yes

Also: I would like to propose that we make the scan item work more consistently across a regular scan and an indexed scan. Right now we send the receiver a prolly::Entry which is a key and val of [u8] https://github.com/rocicorp/repc/blob/fd413d04f5dc9d30ec4f4061c7411cc995beebbb/src/prolly/mod.rs#L11. However:

'key' is overloaded: it will the primary key in the normal scan case but the secondary key in the index scan case. Confusing.

Agree.

there is no distinction between primary and secondary index key values in Entry

Think I agree.

seems like it would be a good thing to separate the structure of the thing we store in the prolly tree from the thing we give to js as a result of scan. Those two probably shouldn't have the same type.

They are separate currently: https://github.com/rocicorp/repc/blob/master/src/embed/connection.rs#L481

They are passed as bare params to JS not the prolly::Entry struct.

So question 2: I'm wondering if we should introduce a new type to pass to js instead of prolly::Entry that makes the two cases more consistent:

// ScanResult is a single result from a scan. Typically a scan will yield many // ScanResults. pub struct ScanResult { // key is the primary key of the result (ie, key in the value map). key will always be set. pub key: [u8],

// secondary_key is set if an index was used for the scan and contains the value of the // secondary key that was indexed. At present, this is the UTF-8 bytes of the string value // that was indexed (repc strips any wrapping it applied internally). pub secondary_key: [u8],

// val is the full value at key above. pub val: [u8], }

If I'm understanding you correctly you are saying that if I put 42, {name: "Bob"} and index by name then when I scan the name index, I'd get: {key: 42, secondary_key: "Bob", val: {name: "Bob"}.

Is that right? So secondary_key would not also contain the primary key concat'd in there?

A few thoughts:

We don't use the term 'secondary' anywhere in the API so maybe we should call that field 'index_key' ?
For the scan() API we have things like start.key that when scanning an index need the secondary key to be passed. Is it weird to call it 'key' in the API input but 'secondary_key' or 'index_key' in the output?

No strong opinions here, just questions. We can just choose something and improve it over time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rocicorp/repc/issues/218#issuecomment-714785321, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATUBABZ6B6AOHWWJPKLGDSMCTF5ANCNFSM4ST5D4RQ .

phritz commented 3 years ago

Regarding adding more tests...

Three cases are mentioned explicitly. Adding testing for sync is clear to me. Adding test for 'scan after write' and '!scan after drop' I think refers to the TODOs in test_index in the wasm tests, correct?

In addition to the above, seems like we should have some unit testing for scan in repc. (I understand we have js tests, but for things like changing indexing we really want some tests in repc itself.) Dispatch tests seem like a good way to go given the complexity of testing it at a lower level. We have these tests which are commented out which seems the obvious place to start. Can you tell me more about why we commented these tests out? Seems like it had to do with sending the results directly to a js function receiver, which may be tricky in the context of this test?

Are there any other tests that we think we should have?

aboodman commented 3 years ago

I feel like from a high level, these are the dimensions you'd want to test:

scan after creating an index from scratch
scan after sync applies a patch to an index
scan after incrementally updating an index (put, clear, del, drop)
both of above both before and after commit (the index should give correct answer even within a transaction, before commit happens)
for last three of above, we want to verify that index contents match exactly index that would be built from scratch

Can you tell me more about why we commented these tests out? Seems like it had to do with sending the results directly to a js function receiver, which may be tricky in the context of this test?

I can't remember if those tests are run in the browser or not. If they are not then that's your problem right there. In that case maybe the answer is to make them run in browser or maybe answer is to introduce a lightweight abstraction that enables Rust to play the role of JS callback.

If they are run in the browser then probably the issue was just the time to learn how to use the correct wasm_bindgen APIs to create a function to receive the results.

Sorry I don't remember any more.

On Thu, Oct 22, 2020 at 1:47 PM Phritz notifications@github.com wrote:

Regarding adding more tests...

Three cases are mentioned explicitly. Adding testing for sync is clear to me. Adding test for 'scan after write' and '!scan after drop' I think refers to the TODOs in test_index https://github.com/rocicorp/repc/blob/d56fc1cc0baa20c4d786355b6c7a4e192ee3c5b9/tests/wasm.rs#L403 in the wasm tests, correct?

In addition to the above, seems like we should have some unit testing for scan in repc. (I understand we have js tests, but for things like changing indexing we really want some tests in repc itself.) Dispatch tests seem like a good way to go given the complexity of testing it at a lower level. We have these tests which are commented out https://github.com/rocicorp/repc/blob/d56fc1cc0baa20c4d786355b6c7a4e192ee3c5b9/tests/wasm.rs#L444 which seems the obvious place to start. Can you tell me more about why we commented these tests out? Seems like it had to do with sending the results directly to a js function receiver, which may be tricky in the context of this test?

Are there any other tests that we think we should have?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rocicorp/repc/issues/218#issuecomment-714821837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATUBDOVMZMOUY3C2Q34MLSMC763ANCNFSM4ST5D4RQ .

phritz commented 3 years ago

Thanks, this is enough for me to get going on both fronts. Will surface if I need more input.

phritz commented 3 years ago

hitting pause and removing from milestone 2

aboodman commented 3 years ago

It might be worth breaking out the remaining issues into their own bugs where not that way already and closing this?

My main thing is that I want to ensure that we have measured and are happy with current performance.

phritz commented 3 years ago

https://github.com/rocicorp/repc/issues/264

phritz commented 3 years ago

@arv I think the only thing left before we can consider this work done for milestone 2 is that @aboodman said:

My main thing is that I want to ensure that we have measured and are happy with current performance.

Unclear to me what if anything there is left to do on this front, or if it is on your or my plate. Do you have a handle on what's left performance-wise, if anything?

arv commented 3 years ago

I'm also not clear what action to take there. Maybe add these perf tests?

[ ] scan over index
[ ] scan that uses a bit deeper JSON path
[ ] populate with deeper JSON path

aboodman commented 3 years ago

I don't think we need the second two. The ones I'd like to see are:

scan over index
pull with {0, 1, 2} indexes present. It's not clear to me how to test this since our current perf infra is in JS and I think we want to keep that for consistency. We need some way to simulate a pull without connecting to server.

aboodman commented 3 years ago

It seems like this work needs to go to the js sdk side.

arv commented 3 years ago

We need some way to simulate a pull without connecting to server.

There is a mock fetch in the tests. We can use the same approach for the perf tests. We would have to figure out what data to use for the pull to make it somewhat realistic.

arv commented 3 years ago

I feel like we might want to change the index tests to not use root as the JSON path. It is not realistic.

aboodman commented 3 years ago

On Thu, Nov 19, 2020 at 8:24 AM Erik Arvidsson notifications@github.com wrote:

We need some way to simulate a pull without connecting to server.

There is a mock fetch in there already. We would have to figure out what data to use for the pull to make it somewhat realistic.

It's fine for it to not be realistic. The scan/put tests just use random 1k chunks or whatever. We can do the same thing. The reason it has to be random is so that idb doesn't compress it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rocicorp/repc/issues/218#issuecomment-730554566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATUBEDW6QXUSPCFZLZRKLSQVPEHANCNFSM4ST5D4RQ .

aboodman commented 3 years ago

OK agreed. I don't know how much of an effect having more complex json pointers is, but worth answering.

On Thu, Nov 19, 2020 at 8:26 AM Erik Arvidsson notifications@github.com wrote:

I feel like we might want to change the index tests to not use root as the path. It is not realistic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rocicorp/repc/issues/218#issuecomment-730555390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATUBGBWMFRPTLGBIOXLTTSQVPKJANCNFSM4ST5D4RQ .

aboodman commented 3 years ago

So this was sort of lost in the rush to complete, but our current benchmarks show that adding 1 index halves population performance.

Compare the "indexes: 1" benchmark with the "indexes: 0" benchmark here: https://replicache-sdk-js-perf-git-perf-data.rocicorp.vercel.app/.

I don't remember this being so bad when I first prototyped things. I think we should investigate before calling this closed. Sorry @phritz :(.

aboodman commented 3 years ago

(Or split off onto separate bug, whatever)

aboodman commented 3 years ago

A big chunk of that dip seemed to come in at https://github.com/rocicorp/repc/releases/tag/v0.13.0 (https://rocicorp.slack.com/archives/CMQQ9EDML/p1605815364167300?thread_ts=1605811133.163100&cid=CMQQ9EDML).

rocicorp / repc

Indexes followup #218