Open aboodman opened 4 years ago
I have a draft patch for 'clear' here: https://github.com/rocicorp/repc/pull/229/commits/b7719e6f0ac48f05ff5a69799ec9d5f1ec42b48f. Not going to land it @phritz you can take it as a starting point or not.
Edit: content moved to top-level comment.
OK trying to formulate the master list of indexing followups. aaron has a good start above but I'd like to round it out with TODOs from the code and issues filed recently. @aboodman @arv please:
P1: blocks clean, correct indexing; in rough order of priority
P2: nice to have; will revisit which of these to do once P1s are complete
TODOs to remove:
say your opinion on the following open question: do we reduce scope to only index strings? I say YES. I think we should keep using bytekey in order to facilitate change or expansion of key types.
yes.
Look over the list below and tell me what if anything is missing or should be shifted around. Once we have agreement I will replace the top comment with the list and get started.
lgtm, just some comments below:
address batching bug, maybe by removing batching or by fixing it, hard to say at the moment
Erik and I talked about this a bit today at the playground. Not sure what the right path is.
fix the method that does not work
The code currently works around this method not working. I left it in hoping to make it work later, as it would be more consistent with the same method in MapReadGuard
.
figure out how to return primary key for index scans (I think we should def do this)
We must return the secondary key if we want to implement the current (batched) scan rpc. Of course it's always possible to change it. Changes to what is returned through scan would necessitate probably changes to the JS API. Not a reason to avoid, just pointing it out.
Thanks. Moved list to top level comment and will fold in anything arv says there.
figure out how to return primary key for index scans (I think we should def do this)
We must return the secondary key if we want to implement the current (batched) scan rpc. Of course it's always possible to change it. Changes to what is returned through scan would necessitate probably changes to the JS API. Not a reason to avoid, just pointing it out.
Seems like we have a few options:
So question 1: should we do 3 above?
Also: I would like to propose that we make the scan item work more consistently across a regular scan and an indexed scan. Right now we send the receiver a prolly::Entry which is a key and val of [u8] https://github.com/rocicorp/repc/blob/fd413d04f5dc9d30ec4f4061c7411cc995beebbb/src/prolly/mod.rs#L11. However:
So question 2: I'm wondering if we should introduce a new type to pass to js instead of prolly::Entry that makes the two cases more consistent:
// ScanResult is a single result from a scan. Typically a scan will yield many
// ScanResults.
pub struct ScanResult {
// key is the primary key of the result (ie, key in the value map). key will always be set.
pub key: [u8],
// secondary_key is set if an index was used for the scan and contains the value of the
// secondary key that was indexed. At present, this is the UTF-8 bytes of the string value
// that was indexed (repc strips any wrapping it applied internally).
pub secondary_key: [u8],
// val is the full value at key above.
pub val: [u8],
}
On Thu, Oct 22, 2020 at 11:57 AM Phritz notifications@github.com wrote:
- Since we are only indexing strings for now we could roll our own index key serialization that enabled us to recover the primary and/or secondary keys from the index key. I think we should NOT do this because it is brittle -- we end up having to re-invent bytekey is we want to add additional types.
- Be able to recover the primary and/or secondary keys from the prolly map entry in the index map. That is, store a struct as the val of an index map entry and have this struct include primary and/or secondary index values. I think we should NOT do this because it violates DRY (the data are in the key).
- Recover the primary and/or secondary keys from the index map by deserializing them from they key. The key is the bytekey-encoded IndexKey. We can deserialize the IndexKey struct https://docs.rs/bytekey-fix/0.5.0/bytekey_fix/de/index.html as long as we know the types, which we do because for now the secondary key is always a string. In the future if we index more than strings we can know the type using the index definition. I think we SHOULD do this, keeping bytekey and our value wrapper enums in case we want to index additional types in the future.
I vote (3) but I thought that since IndexKey.IndexValue is an enum, we don't need to do anything else to deserialize it correctly. That is, we don't need to do:
we can know the type using the index definition.
in case we want to index additional types in the future
agree
So question 1: should we do 3 above?
Yes
Also: I would like to propose that we make the scan item work more consistently across a regular scan and an indexed scan. Right now we send the receiver a prolly::Entry which is a key and val of [u8] https://github.com/rocicorp/repc/blob/fd413d04f5dc9d30ec4f4061c7411cc995beebbb/src/prolly/mod.rs#L11. However:
- 'key' is overloaded: it will the primary key in the normal scan case but the secondary key in the index scan case. Confusing.
Agree.
- there is no distinction between primary and secondary index key values in Entry
Think I agree.
- seems like it would be a good thing to separate the structure of the thing we store in the prolly tree from the thing we give to js as a result of scan. Those two probably shouldn't have the same type.
They are separate currently: https://github.com/rocicorp/repc/blob/master/src/embed/connection.rs#L481
They are passed as bare params to JS not the prolly::Entry struct.
So question 2: I'm wondering if we should introduce a new type to pass to js instead of prolly::Entry that makes the two cases more consistent:
// ScanResult is a single result from a scan. Typically a scan will yield many // ScanResults. pub struct ScanResult { // key is the primary key of the result (ie, key in the value map). key will always be set. pub key: [u8],
// secondary_key is set if an index was used for the scan and contains the value of the // secondary key that was indexed. At present, this is the UTF-8 bytes of the string value // that was indexed (repc strips any wrapping it applied internally). pub secondary_key: [u8],
// val is the full value at key above. pub val: [u8], }
If I'm understanding you correctly you are saying that if I put
42, {name: "Bob"}
and index byname
then when I scan the name index, I'd get:{key: 42, secondary_key: "Bob", val: {name: "Bob"}
.
Is that right? So secondary_key would not also contain the primary key concat'd in there?
A few thoughts:
No strong opinions here, just questions. We can just choose something and improve it over time.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rocicorp/repc/issues/218#issuecomment-714785321, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATUBABZ6B6AOHWWJPKLGDSMCTF5ANCNFSM4ST5D4RQ .
Regarding adding more tests...
Three cases are mentioned explicitly. Adding testing for sync is clear to me. Adding test for 'scan after write' and '!scan after drop' I think refers to the TODOs in test_index in the wasm tests, correct?
In addition to the above, seems like we should have some unit testing for scan in repc. (I understand we have js tests, but for things like changing indexing we really want some tests in repc itself.) Dispatch tests seem like a good way to go given the complexity of testing it at a lower level. We have these tests which are commented out which seems the obvious place to start. Can you tell me more about why we commented these tests out? Seems like it had to do with sending the results directly to a js function receiver, which may be tricky in the context of this test?
Are there any other tests that we think we should have?
I feel like from a high level, these are the dimensions you'd want to test:
Can you tell me more about why we commented these tests out? Seems like it had to do with sending the results directly to a js function receiver, which may be tricky in the context of this test?
I can't remember if those tests are run in the browser or not. If they are not then that's your problem right there. In that case maybe the answer is to make them run in browser or maybe answer is to introduce a lightweight abstraction that enables Rust to play the role of JS callback.
If they are run in the browser then probably the issue was just the time to learn how to use the correct wasm_bindgen APIs to create a function to receive the results.
Sorry I don't remember any more.
On Thu, Oct 22, 2020 at 1:47 PM Phritz notifications@github.com wrote:
Regarding adding more tests...
Three cases are mentioned explicitly. Adding testing for sync is clear to me. Adding test for 'scan after write' and '!scan after drop' I think refers to the TODOs in test_index https://github.com/rocicorp/repc/blob/d56fc1cc0baa20c4d786355b6c7a4e192ee3c5b9/tests/wasm.rs#L403 in the wasm tests, correct?
In addition to the above, seems like we should have some unit testing for scan in repc. (I understand we have js tests, but for things like changing indexing we really want some tests in repc itself.) Dispatch tests seem like a good way to go given the complexity of testing it at a lower level. We have these tests which are commented out https://github.com/rocicorp/repc/blob/d56fc1cc0baa20c4d786355b6c7a4e192ee3c5b9/tests/wasm.rs#L444 which seems the obvious place to start. Can you tell me more about why we commented these tests out? Seems like it had to do with sending the results directly to a js function receiver, which may be tricky in the context of this test?
Are there any other tests that we think we should have?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rocicorp/repc/issues/218#issuecomment-714821837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATUBDOVMZMOUY3C2Q34MLSMC763ANCNFSM4ST5D4RQ .
Thanks, this is enough for me to get going on both fronts. Will surface if I need more input.
hitting pause and removing from milestone 2
It might be worth breaking out the remaining issues into their own bugs where not that way already and closing this?
My main thing is that I want to ensure that we have measured and are happy with current performance.
@arv I think the only thing left before we can consider this work done for milestone 2 is that @aboodman said:
My main thing is that I want to ensure that we have measured and are happy with current performance.
Unclear to me what if anything there is left to do on this front, or if it is on your or my plate. Do you have a handle on what's left performance-wise, if anything?
I'm also not clear what action to take there. Maybe add these perf tests?
I don't think we need the second two. The ones I'd like to see are:
It seems like this work needs to go to the js sdk side.
We need some way to simulate a pull without connecting to server.
There is a mock fetch in the tests. We can use the same approach for the perf tests. We would have to figure out what data to use for the pull to make it somewhat realistic.
I feel like we might want to change the index tests to not use root as the JSON path. It is not realistic.
On Thu, Nov 19, 2020 at 8:24 AM Erik Arvidsson notifications@github.com wrote:
We need some way to simulate a pull without connecting to server.
There is a mock fetch in there already. We would have to figure out what data to use for the pull to make it somewhat realistic.
It's fine for it to not be realistic. The scan/put tests just use random 1k chunks or whatever. We can do the same thing. The reason it has to be random is so that idb doesn't compress it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rocicorp/repc/issues/218#issuecomment-730554566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATUBEDW6QXUSPCFZLZRKLSQVPEHANCNFSM4ST5D4RQ .
OK agreed. I don't know how much of an effect having more complex json pointers is, but worth answering.
On Thu, Nov 19, 2020 at 8:26 AM Erik Arvidsson notifications@github.com wrote:
I feel like we might want to change the index tests to not use root as the path. It is not realistic.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rocicorp/repc/issues/218#issuecomment-730555390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATUBGBWMFRPTLGBIOXLTTSQVPKJANCNFSM4ST5D4RQ .
So this was sort of lost in the rush to complete, but our current benchmarks show that adding 1 index halves population performance.
Compare the "indexes: 1" benchmark with the "indexes: 0" benchmark here: https://replicache-sdk-js-perf-git-perf-data.rocicorp.vercel.app/.
I don't remember this being so bad when I first prototyped things. I think we should investigate before calling this closed. Sorry @phritz :(.
(Or split off onto separate bug, whatever)
A big chunk of that dip seemed to come in at https://github.com/rocicorp/repc/releases/tag/v0.13.0 (https://rocicorp.slack.com/archives/CMQQ9EDML/p1605815364167300?thread_ts=1605811133.163100&cid=CMQQ9EDML).
P1: blocks clean, correct indexing; in rough order of priority
P2: nice to have; will revisit which of these to do once P1s are complete
TODOs to remove: