we can put() values that are not json

phritz commented 3 years ago

Discovered when getting indexing working: replicache will happily let you put() values that are not json. It doesn't matter until we go to add an index, in which case the value doesn't parse. We should make a call what to do about this:

ensuring we can parse the value being put() as json before storing it seems to be the correct thing
another option is to just let them put() whatever they want and then silently skip indexing values that are not json, maybe with a log info?

aboodman commented 3 years ago

I think it's a feature that repc supports non-json data (think of blobs in the future).

This issue of indexing non-json data is a special case of a more general question: what happens when indexing fails? This can happen even with json data if you specify a path that isn't found in some of the data being indexed.

So really the options are:

Propagate indexing errors (including could-not-parse-as-json, could-not-find-indexed-path, indexed-path-is-wrong-type, etc) up to put() synchronously (e.g., the put fails if it can't be indexed by one of the matching indexes)
Silently ignore indexing errors
Allow puts that cause an indexing error to proceed, but print something to the console

I don't have a good feel for which of these is best. (1) is most conservative so I like that, but I could imagine it's a feature you can put() data that lacks a field being indexed. Also (1) is an easy way to kill sync. OTOH, maybe we just tell developers to normalize data such that this error doesn't happen.

What is your opinion?

arv commented 3 years ago

Something else we talked about (in person I believe) is to allow indexing over other data formats. The index definition could provide different types of index support for different data types such as Flatbuffer, BSON, Protobufs etc

phritz commented 3 years ago

I think it's a feature that repc supports non-json data (think of blobs in the future).

Sure I agree about the future, but that's not how replicache works right now. The putrequest takes a string and the database api the customer sees is defined to be from strings to raw json bytes. I think we agree that it's nice to work with a system that has guarantees you can bank on and that it's harder to work in a system that is described one way but doesn't actually work that way. My feeling is that if we say the database is from strings to json bytes that's what it should be. If that's true then we can go write features under that assumption. If that's not true then we are going to have to keep an underspecified set of ways replicache could work in our heads at all times, even though there is no present value to them. (Said another way: there should be a solid line describing the db and it should fit reality; there should not be a tentative dotted line that is subject to interpretation.)

There are so many unexpected ways this might manifest that it's hard to pick good examples but consider: if you can write not-json into the database locally then A) you can write values that we cannot push up to the data layer in args (bc args are json) and B) you can write values that the server cannot return to you in a pull (bc pull returns json). The json assumption is baked in pretty deep and I'm not sure the value of undermining it.

what happens when indexing fails?

I like the idea of 1 but worry it is too easy to blow things up. Maybe we could default to 3 but have a "strict" index bit that customers could set that treats errors as errors?

aboodman commented 3 years ago

The putrequest takes a string and the database api the customer sees is defined to be from strings to raw json bytes.

Can we be specific with the parties? I think that the customer (code calling the SDK) current provides native JS values (objects, arrays, and so-forth). Or put another way - decoded JSON. That is also what we send the customer back in the batch endpoint (right?) and ask them to provide in the client view endpoint (except in that case they encode it as a string).

I think we agree that it's nice to work with a system that has guarantees you can bank on and that it's harder to work in a system that is described one way but doesn't actually work that way.

Agreed. Just trying to figure out what those guarantees are.

My feeling is that if we say the database is from strings to json bytes that's what it should be. If that's true then we can go write features under that assumption. If that's not true then we are going to have to keep an underspecified set of ways replicache could work in our heads at all times, even though there is no present value to them. (Said another way: there should be a solid line describing the db and it should fit reality; there should not be a tentative dotted line that is subject to interpretation.)

Agree.

There are so many unexpected ways this might manifest that it's hard to pick good examples but consider: if you can write not-json into the database locally then A) you can write values that we cannot push up to the data layer in args (bc args are json)

OK

and B) you can write values that the server cannot return to you in a pull (bc pull returns json).

I think it returns a string currently (that is handled opaquely).

The json assumption is baked in pretty deep and I'm not sure the value of undermining it.

OK. Suppose I agree. What affect does that have on us now? I'm loathe to add a validation step to put()/sync() just to ensure data is valid json.

phritz commented 3 years ago

ok yes good to be specific

Can we be specific with the parties? I think that the customer (code calling the SDK) current provides native JS values (objects, arrays, and so-forth). Or put another way - decoded JSON. That is also what we send the customer back in the batch endpoint (right?) and ask them to provide in the client view endpoint (except in that case they encode it as a string).

i can't say about what the bindings expose but willing to believe it is native js objects. but i would not say that the other pieces use native js objects or decoded json.

upstream:
- mutations contain args as encoded json
downstream:
- the client view returns a map from string keys to encoded json
- the diffserver returns a patch whose ops contain stringified encoded json.

so seems true that if you write non-json values into replicache you can neither push them nor pull them.

The json assumption is baked in pretty deep and I'm not sure the value of undermining it. OK. Suppose I agree. What affect does that have on us now? I'm loathe to add a validation step to put()/sync() just to ensure data is valid json.

yeah i guess validation would be a pretty big performance hit. i dont know if there is any action required other than for the three of us to affirm:

replicache stores string key to (encoded) json values. we should not say that it does otherwise, though it could.
if a value in replicache is not valid json that is an error (in the sense that it is unexpected and someone should go look at what happened)
the bindings keep non-json from going into replicache

rocicorp / repc

we can put() values that are not json #251