googleapis / nodejs-datastore

Node.js client for Google Cloud Datastore: a highly-scalable NoSQL database for your web and mobile applications.
https://cloud.google.com/datastore/
Apache License 2.0
214 stars 106 forks source link

[Bug] Query with filter on an integer column. (JS has number not int and float) #754

Open vicb opened 3 years ago

vicb commented 3 years ago

1) Is this a client library issue or a product issue?

I guess a bug in the client library ("@google-cloud/datastore": "^6.3.0")

Environment details

Steps to reproduce

1) Create an entity in the cloud console with an integer column, named integer_column and set its value to 200.

2) Executes a query with a filter on that column:

    const query = datastore
      .createQuery(tableName)
      .filter('integer_column', '>', value);

    const [results] = await datastore.runQuery(query);

Expected behavior:

The correct results are returned if value is a valid numerical value, i.e. 123 or 123.456

Actual behavior

The correct results are only returned if value is 123. A float value (123.456) would return an empty result.

Workaround

Update the query to add Math.round().

    const query = datastore
      .createQuery(tableName)
      .filter('integer_column', '>', Math.round(value));

    const [results] = await datastore.runQuery(query);

Thanks!

stephenplusplus commented 3 years ago

Thanks for opening the issue! What would be an ideal solution on our side? Running Math.floor() on the float? Throw an error, specifying only an integer is expected?

vicb commented 3 years ago

The ideal solution is that .filter('integer_column', '>', value) works whatever the (numerical) value.

I mean 10 > 5.123 == true is something everybody expects to work, whatever the language/API.

Edit: The same way, I also expect .filter('float_column', '>', 10) to work and I don't have to type .filter('float_column', '>', 10.0)

stephenplusplus commented 3 years ago

The issue I'm seeing is that the API expects an integer value to compare against integers. It cannot compare floats against integers. So, if we're given a float of 123.456, we would have to lose precision to reduce it to 123. Of course, the results could never include a result where integer_column contains 123.456, so a result with 124 will come through in both cases, as that's the next possible value. What I'd be concerned about, is that we'd be adding code that could be misleading-- having the user assume it's possible to have float precision where they cannot. It may be best to continue sanitizing non-integer input in the application layer, to prepare it for the integer operation.

vicb commented 3 years ago

1) Again everybody expects 10 > 5.123 == true. The opposite is very misleading.

2) Thoughts:

So, if we're given a float of 123.456, we would have to lose precision to reduce it to 123

You have to be careful about the rounding here: value > 123.001 is equivalent to value > 123 but value < 123.001 is equivalent to value < 124

While it makes sense to round for >, >=, <= and < it is reasonable to say that 123 = 123.456 == false (i.e. do not apply rounding for equality).

An other question I would have is at what level the type is stored ? From the cloud console UI it seems like the type is store at entity level. i.e. 2 entity might have different types for the same column. type(entityA.some_col) == integer while type(entityB.some_col) == float Is this true ? If it is the API should offer a way to pass a float value otherwise you would have to do 2 queries (one with int and one with float) to do something like some_col > 123.001.

I honestly never asked myself the question as I never set the type of the column explicitly. I just saved entity using the API. I only noticed this strange behavior when doing some tests and creating entity via the console.

edit: fixed rounding for > and <

crwilcox commented 3 years ago

Note: Other language clients have similar behavior.

from google.cloud import datastore

client = datastore.Client()

kind = "test"
id = "test_id"
key = client.key(kind, id)
entity = datastore.entity.Entity(key)
entity['field'] = 'field_value'
entity['integer_column'] = 200
client.put(entity)

query = client.query(kind=kind).add_filter('integer_column', '>', 123.3)
results = list(query.fetch())
print(f"Query comparing an integer and float (none): {results}")

query = client.query(kind=kind).add_filter('integer_column', '>', 123)
results = list(query.fetch())
print(f"Query comparing an integer and integer (one result): {results}")
Query comparing an integer and float (none): []
Query comparing an integer and integer (one result): [<Entity('test', 'test_id') {'field': 'field_value', 'integer_column': 200}>]
BenWhitehead commented 3 years ago

Similar behavior in Java:

    @Test
    void numberSaving() {
      final Datastore ds = datastoreOptions.getService();
      final String kind = "valueTypeTesting";

      final String long_property = "long_property";
      final String double_property = "double_property";
      final String number_property = "number_property";

      KeyFactory txTesting = ds.newKeyFactory().setKind(kind);
      final Entity a = Entity.newBuilder(txTesting.newKey("A"))
          .set(long_property, 20)
          .set(double_property, 20.0)
          .set(number_property, 30.0)
          .build();
      final Entity b = Entity.newBuilder(txTesting.newKey("B"))
          .set(long_property, 10)
          .set(double_property, 15.0)
          .set(number_property, 18)
          .build();
      ds.put(a);
      ds.put(b);

      final EntityQuery qInt = Query.newEntityQueryBuilder()
          .setKind(kind)
          .setFilter(PropertyFilter.gt(number_property, 5))
          .build();
      final EntityQuery qDouble = Query.newEntityQueryBuilder()
          .setKind(kind)
          .setFilter(PropertyFilter.gt(number_property, 5.0))
          .build();

      final QueryResults<Entity> qIntRun = ds.run(qInt);
      LOGGER.info("qIntProto = {}", protoToString(getQueryProto(qInt)));
      while (qIntRun.hasNext()) {
        Entity next = qIntRun.next();
        LOGGER.debug("entityProto = {}", protoToString(entityProto(next)));
        if ("A".equals(next.getKey().getName())) {
          assertEquals(30.0, next.getDouble(number_property));
        } else if ("B".equals(next.getKey().getName())) {
          assertEquals(18, next.getLong(number_property));
        }
      }

      final QueryResults<Entity> qDoubleRun = ds.run(qDouble);
      LOGGER.info("qDoubleProto = {}", protoToString(getQueryProto(qDouble)));
      while (qDoubleRun.hasNext()) {
        Entity next = qDoubleRun.next();
        LOGGER.debug("entityProto = {}", protoToString(entityProto(next)));
        if ("A".equals(next.getKey().getName())) {
          assertEquals(30.0, next.getDouble(number_property));
        } else if ("B".equals(next.getKey().getName())) {
          assertEquals(18, next.getLong(number_property));
        }
      }
    }
qIntProto = { query { kind { name: "valueTypeTesting" } filter { property_filter { property { name: "number_property" } op: GREATER_THAN value { integer_value: 5 } } } } }
entityProto = { key { partition_id { project_id: "default-223119" } path { kind: "valueTypeTesting" name: "B" } } properties { key: "double_property" value { double_value: 15.0 } } properties { key: "long_property" value { integer_value: 10 } } properties { key: "number_property" value { integer_value: 18 } } }
entityProto = { key { partition_id { project_id: "default-223119" } path { kind: "valueTypeTesting" name: "A" } } properties { key: "double_property" value { double_value: 20.0 } } properties { key: "long_property" value { integer_value: 20 } } properties { key: "number_property" value { double_value: 30.0 } } }
qDoubleProto = { query { kind { name: "valueTypeTesting" } filter { property_filter { property { name: "number_property" } op: GREATER_THAN value { double_value: 5.0 } } } } }
entityProto = { key { partition_id { project_id: "default-223119" } path { kind: "valueTypeTesting" name: "A" } } properties { key: "double_property" value { double_value: 20.0 } } properties { key: "long_property" value { integer_value: 20 } } properties { key: "number_property" value { double_value: 30.0 } } }
crwilcox commented 3 years ago

I think the root of the problem is, at query filter creation time, the client has no knowledge of the type of the field being filtered. the type needs to be figured out from the type provided. So, as a user you can round. I don't think the client can do that without side effects.

vicb commented 3 years ago

Thanks for checking other clients.

I don't think the client can do that without side effects.

May be it should be done server side then ?

In the meantime it would probably be worth documenting this behavior. I think we can agree to say that this behavior is really unexpected ?

vicb commented 3 years ago

The concern is expressed in my former message is real.

Let's init a table with [0...9] / 2:

  for (let value = 0; value < 10; value++) {
    const key = datastore.key('TEST');
    await datastore.save({
      key,
      data: {
        value: value / 2,
      }
    })
  }  

You can go to the console and you will see that 0,1,2, 3, and 4 are integer 0.5, 1.5, 2.5, 3.5, and 4.5 are floats.

Sanity check:

  const [allEnt] = await datastore.createQuery('TEST').order('value').run();
  console.log(`allEnt -> ${allEnt.map(e => e.value)}`);

  // allEnt -> 0,1,2,3,4,0.5,1.5,2.5,3.5,4.5

Query with an integer value

  const [intEnt] = await datastore.createQuery('TEST').filter('value', '>', 3).order('value').run();
  console.log(`intEnt > 3 -> ${intEnt.map(e => e.value)}`);

 // intEnt > 3 -> 4,0.5,1.5,2.5,3.5,4.5

Ouch !

Query with float value:

  const [floatEnt] = await datastore.createQuery('TEST').filter('value', '>', 2.5).order('value').run();
  console.log(`floatEnt > 2.5 -> ${floatEnt.map(e => e.value)}`);  

  // floatEnt > 2.5 -> 3.5,4.5

Ouch again !

Maybe the server API is fine for typed language (I haven't checked the client) but it is definitely not for JS (and probably Python).

@stephenplusplus @crwilcox @BenWhitehead - I think we can agree to say that is behavior is bug prone and should be fixed ?

crwilcox commented 3 years ago

I am not sure I would consider this a bug, but more a result of us trying to make things ergonomic in libraries. Filters must specify, via proto, the value type that the filter is evaluating.

So, when you set a filter, it is converted to a proto, and the type is picked at that point. You can see this in @BenWhitehead response if you look at the qProto.

{
  query {
    kind {
      name: "valueTypeTesting"
    }
    filter {
      property_filter {
        property { name: "number_property" }
        op: GREATER_THAN
        value { double_value: 5.0 } 
      }
    }
  }
}

With that said, I understand how as a user it seems that, perhaps, the backend could recognize that an int_value: 10 is greater than double_value: 5.0

vicb commented 3 years ago

@crwilcox how is this not a bug.

JS has numbers not int or float. If you look at my example I save some numbers and because some happen to be ints then there is no way to get back a correct result without having to make 2 queries (1 for the integer values and 1 for the floating points).

So basically if you have a product entities with a price, some prices will be integer ($10) and some will be float ($9.99). You would need two queries + some post-processing (for orders and limits) to get a correct result if you want a list of products with price < $15.

So, when you set a filter, it is converted to a proto, and the type is picked at that point. You can see this in @BenWhitehead response if you look at the qProto.

This seems to indicate that the bug should be fixed server side. Again if you look at my examples the same code could result in entities having either an integer or a floating point field. So you do not know the type at query time - or more exactly you known that there could be 2 different types.

crwilcox commented 3 years ago

@vicb I think what might be missing is that we have to account for folks using other languages, accessing the database not just from node. For instance, since JS has no separationg of int, float, we can do one of a couple things.

As best I can figure, we have three options on the client side (see later for backend discussion):

1) The Node.js Client always converts number to int in datastore

2) The Node.js Client always converts number to float in datastore

3) Current Behavior: Through inference, decide if the given number is best reflected by int, or float value, within the database.

As for backend, I agree it seems this may be addressable. A discussion could be started around allowing the backend to filter across disparate types, but I am not certain this would be straightforward, but there is no reason not to have that conversation. Though as that is no longer about the Node.js client, I would ask we continue that discussion on the appropriate issue tracker with the backend folks if possible.

If you believe I have missed an option for the client I would be happy to discuss it further, it is quite possible I am not thinking outside the box enough 😄

vicb commented 3 years ago

Hey @crwilcox,

I would add the following the client side options (1-3 below refer to the options in your message).

1) I am not exactly what you mean here. If you mean remove any support for floating point from the Datastore, I don't think this is a viable option. If you mean converting number to int for queries (as suggested by @stephenplusplus above) then it would still not solve the floating point issue.

2) Treating all number as floating point. It could work with JS because ints are encoding on 53bits only. Not sure about other dynamic langage. But there would still be a few problems on top of interoperability:

3) Current Behavior As you noted ("sticking to int or double for a value. I understand that this isn't a node concept.") and as far as I know there is no way to force a number to be a float in JS (only integer can be forced as far as I know).

4) Maybe one solution would be to change the API to have the user specifying the type. But again it will break legacy DB that have mixed integer/float for the same field.

In the end I don't really see a way to fix that on the client side.

As for backend, I agree it seems this may be addressable. A discussion could be started around allowing the backend to filter across disparate types, but I am not certain this would be straightforward, but there is no reason not to have that conversation. Though as that is no longer about the Node.js client, I would ask we continue that discussion on the appropriate issue tracker with the backend folks if possible.

It sounds like a great idea please include me in the discussion.

Thanks !

vicb commented 3 years ago

Reading the docs, it looks like this behavior is actually documented

TL;DR Integers are sorted before FPs.

I think this makes working with FP values in JS (and probably other languages) 1) hard - you would have to issue at least 2 queries to get all numerical values > or < to a limit (one for int, one for fp), 2) inefficient - querying for fp_colum < value will dump all the integer values. Similarly for int_colum > value dumping all the fps.

crwilcox commented 3 years ago

Adding here after chatting a bit further. Because of the inference of the client, it is hard to only insert doubles. If you want to enter values 1,2,3.99, there isn't a way, through this method, to make sure these are all doubles.

At the same time, I think users can send the value as a custom type: https://github.com/googleapis/nodejs-datastore/blob/master/src/entity.ts#L62

This would ensure the typing of the column. I wonder if there shouldn't be a warning on Number type autoconversion? In the case that you only ever use ints, it is probably fine, but it seems difficult to avoid doubles with .00 fractional.

crwilcox commented 3 years ago

@vicb would it be fair to characterize the status of this as the following:

As a user you would like to see the untyped entries to not be the preferred way via node as it is difficult (not feasible) to keep type consistency, for example the field values (1.99, 3.99, 4.0 (which will be seen as an int)).

By not preferred I think there are a few solutions

  1. Updates to the product docs to draw attention to this.
  2. Add a client side merge of these, or an explicit client side "or". This could be a bit heavy, and is double the queries.
  3. Add a warning to be emitted by the methods (puts where data includes the auto conversion, queries where we have auto conversion).
  4. deprecate the auto conversion.

The final option is the most extreme to be sure and would mean a major version revision.

What do you think is necessary here as a user? Also, seeing as you seem to have done a good deal of homework on this, are there areas you can point us to in the docs where you think improvements should be made?

vicb commented 3 years ago

Yes so the whole problem is that in JS Number.IsInteger(4.0) === true.

This basically makes the main API of datastore unusable for float numbers as you can not rule out some will be considered ints.

// The behavior of this is counter intuitive as soon as you query the value field
datastore.save(
  key,
  data: {
    value: 1.23
  }
};

// The behavior of this is counter intuitive when you deal with a float field
// For the same reason limit and order have counter intuitive behaviors.
datastore.createQuery('Product').filter('price', '<', 1.23);

To answer your points:

1) IMO updating the docs is not enough, because:

That being said, if this should be documented, there should be some examples to make the counter intuitive behavior clear.

2) The only warning I can imagine would be emitted every time a user saves a numerical value. It might be useful when there is a clear path to remove the current limitation but I am not sure if it would be helpful before that.

3) If you decide to fix this issue on the client side (also applicable to other dynamically typed languages) then I think deprecating and removing the auto-conversion is a required step. It means that every single client application must be updated (as long as they are dealing with numbers). Note that it would still not solve the problem with values created before the major version - all DB would need to be migrated if users are impacted by the limitation.

Updating the query code would be quite straightforward:

// current unsafe API
const query = datastore.createQuery('Product').filter('price', '<', 4.0);
// next safe API
const query = datastore.createQuery('Product').filter('price', '<', double(4.0));

However updating your entities would be much more of a pain:

{
  price: 4.0
  qty: 10,
  id: 10,
  sale: 0.50,
}

You basically have to wrap every single numeric field in int() or double().

A few comments/ideas:

1) Instead of having to manually convert all entities in order to wrap numeric field, it might be good to specify a schema:

{
  price: double
  qty: int,
  id: int,
  sale: double,
}

The schema could be easily passed in datastore.save(), same way the current API supports excludeFromIndexes. It might feel a little awkward to pass the schema to datastore.createQuery(...). So maybe a good solution would be to create a Schema registry that you would init once ? Maybe as an option to new Datastore() ?

2) I still think a server side fix would be much easier on the users.

I am sorry not to have a good turnkey solution to propose right now. It would probably be good to investigate more deeply different solutions, their impact on the client code and their cost/feasibility on the API or server side.

crwilcox commented 3 years ago

I think the way forward is to start with exposing, in docs, the double() and int() representations, maybe not highlighting inference for numbers. I think this would be needed on save and query docs.

  // current unsafe API
  const query = datastore.createQuery('Product').filter('price', '<', 4.0);
  // next safe API 
  const query = datastore.createQuery('Product').filter('price', '<', double(4.0));

Regarding

I still think a server side fix would be much easier on the users.

From a server standpoint the sort order is defined already. There is potential to add an 'OR' query that would essentially be a merge query. That said, even with that, the way inference is happening that wouldn't solve this interface problem on it's own, since that is really a double() or int()

meredithslota commented 2 years ago

We're considering whether an "OR" query can be added so I'm converting this to a feature request as one of the suggested outcomes of this.