Query validation error - gt and lte create erroneous filter expression

rcoundon commented 1 year ago

Describe the bug When using gt or lte ElectroDB adds an erroneous filter expression which can cause the DynamoDB client to throw a validation error because filter expressions cannot contain SK attributes.

ElectroDB Version 2.5.1

ElectroDB Playground Link ElectroDB Playground

Entity/Service Definitions All in Playground

Expected behavior No filter expression to be added

Errors

Filter Expression can only contain non-primary key attributes

rcoundon commented 1 year ago

Note - the lt and gte methods don't do this.

tywalch commented 1 year ago

Hey Ross!

I was able to recreate this issue when I mapped by index fields to attributes (like in this playground here). In the instance of your error, is your model doing similar mappings -- the index name is also the attribute name? If so, could you share your model and/or one that maps attributes and keys in a similar manor?

Expected behavior No filter expression to be added As to the expected behavior, do any of these capture why you don't expect a filter expression:

Your did not explicit call for a filter so you don't expect for there to be one (more control over when filters are used)

A filter in this case throws, so ElectroDB should be smart enough not to add one (smarter software, less hassle)

Assuming this is because of the attribute names mirroring the index field names, I definitely need to figure out a good plan for this case. Bottom line is that Electro shouldn't make invalid parameters like this 👍

rcoundon commented 1 year ago

Hi Ty

This is what the Entity looks like:

export const NOTIFICATION_SCHEMA = new Entity(
  {
    model: {
      entity: 'Notification',
      version: '2.0.0',
      service: 'aa',
    },
    attributes: {
      datasetId: { type: 'string', required: true },
      activityId: { type: 'string', required: true },
      leadentId: { type: 'string', required: true },
      apptDate: { type: CustomAttributeType<DtgIsoDate>('string'), required: true },
      createTime: { type: CustomAttributeType<DtgIsoFull>('string'), required: true },
      ruleName: { type: 'string', required: true },
      ruleType: { type: CustomAttributeType<RuleType>('string'), required: true },
      targetAddress: { type: 'string' },
      targetTime: { type: CustomAttributeType<DtgIsoFull>('string'), required: true },
      requestTime: { type: CustomAttributeType<DtgIsoFull>('string') },
      requestId: { type: 'string' },
      requestAttempts: { type: 'number', default: 0, required: true },
      deliveryStatus: { type: CustomAttributeType<DeliveryStatus>('string'), required: true },
      deliveryMethod: { type: CustomAttributeType<DeliveryMethod>('string'), required: true },
      deliveryTime: { type: CustomAttributeType<DtgIsoFull>('string') },
      failureReason: { type: 'string' },
      messageText: { type: 'string', required: true },
      messageParts: { type: 'number' },
      siblingKey: {
        type: 'map',
        properties: {
          ruleActivityReference: { type: 'string', required: true }, // HASH key
          ruleInstance: { type: 'string', required: true }, // RANGE key
        },
      },
      pendingActivityReference: { type: 'string' },
      pendingMethod: { type: 'string' },
      ttl: { type: 'number', required: true },
    },
    indexes: {
      primary: {
        pk: {
          field: 'ruleActivityReference',
          composite: ['datasetId', 'activityId'],
          template: ACTIVITY_REFERENCE_TEMPLATE,
          casing: 'none',
        },
        sk: {
          field: 'ruleInstance',
          composite: ['ruleName', 'createTime', 'deliveryMethod'],
          template: '${ruleName}#${createTime}#${deliveryMethod}',
          casing: 'none',
        },
      },
      byPendingMethod: {
        index: 'GSI1_pending',
        pk: {
          field: 'pendingMethod',
          composite: ['pendingMethod'],
          casing: 'none',
        },
        sk: {
          field: 'targetTime',
          composite: ['targetTime'],
          casing: 'none',
        },
      },
      byPendingActivityReference: {
        index: 'GSI2',
        pk: {
          field: 'pendingActivityReference',
          composite: ['pendingActivityReference'],
          casing: 'none',
        },
        sk: {
          field: 'ruleName',
          composite: ['ruleName'],
          casing: 'none',
        },
      },
    },
  },
  {
    table: process.env.DYNAMODB_TABLE_NOTIFICATIONS,
  },
);

and we're calling

model.query
      .byPendingMethod({
        pendingMethod: method,
      })
      .lte({
        targetTime: now,
      })
      .go({ limit, ignoreOwnership: true });

So, yeah, if I've understood you correctly, the index field does match the attribute name.

tywalch commented 1 year ago

Damn, this makes it a bit more difficult because the current design leans into dynamo to filter to limit post processing in app code. I'll need to look at this a bit deeper on the backside to see what options make sense 👍

PaulJNewell77 commented 1 year ago

Hi Ty, is this fix released and is it in the version that ElectroDB Playground uses? I notice that the Playground link still exhibits the behaviour that is described in this issue - unless I'm missing something?

tywalch commented 1 year ago

Hi Ty, is this fix released and is it in the version that ElectroDB Playground uses? I notice that the Playground link still exhibits the behaviour that is described in this issue - unless I'm missing something?

I will check this today, it is possible there was an issue and the playground is lagging behind latest 👍

tywalch commented 1 year ago

@PaulJNewell77 I plugged in the entity and query @rcoundon shared into the playground (no changes) and it looks correct to me: here. The fix takes into account whether or not the sortkey composite attribute is also the field name of the index. In the first playground link shared in this issue, that wasn't the case, so I would expect the parameters generated for that example to remain the same.

Let me know if you see something different or there's still more that needs to change for your use case 👍

tywalch commented 1 year ago

I reopened this issue until things are confirmed to work for you 👍

PaulJNewell77 commented 1 year ago

Hi, the fix does work for us in our use case although I there may still be a related issue. Here is a Playground link which is a tweek of the one in the original description. There are four queries in there: gte, gt, lt, lte. The gt and lte queries both have filter expressions whereas gte and lt do not. I don't think the filter expressions are needed - although I think they may still be valid.

PaulJNewell77 commented 1 year ago

Further to this, here is a Playground link with another slight tweak, adding 'team' to the sk of the index we are querying. Now the queries do need filter expressions, but the gte and lt queries still do not have them, meaning they don't reference the queried date '2021-01' at all, so I assume they won't work.

As always .... I could be missing something.

tywalch commented 1 year ago

Hi, the fix does work for us in our use case although I there may still be a related issue. Here is a Playground link which is a tweek of the one in the original description. There are four queries in there: gte, gt, lt, lte. The gt and lte queries both have filter expressions whereas gte and lt do not. I don't think the filter expressions are needed - although I think they may still be valid.

Yes, this is expected and deliberate 👍 These filters help trim the edges of what comes back so that it doesn't have to be done in code. I could go into more detail, but I'll put more focus on your other comment.

Further to this, here is a Playground link with another slight tweak, adding 'team' to the sk of the index we are querying. Now the queries do need filter expressions, but the gte and lt queries still do not have them, meaning they don't reference the queried date '2021-01' at all, so I assume they won't work.

As always .... I could be missing something.

This won't be satisfying because candidly I'm trying to remember the exactly why, but I long while ago I removed the constraint to throw when a composite was provided out of order. The filters added to gte and lt we're done more to assist with partial composites (precisely like 2021-01) but it's generally been the case I've tried to avoid adding any gratuitous filters, leaving that to users.

Adding a filter may make sense, since the user's intent seems clear, but it could also result in unexpected behavior. Also it does add some weird cases around collections, but we can put those aside of the moment.

I will say that gte gt lt and lte with >1 composite sort key and partial values can be a bit of a can of worms. I'd love to get your thoughts on the following scenarios:

Using this example, a user says

tasks.query
  .backlog({ project })
  .gt({
    team: "blue",
    closed: "2021-01",
  })
  .go();

Which of the following is their intent?

Items closed in February for the blue team?
Items closed in February for teams that lexicographically sort greater than (not including) the blue team?
Items closed in Feburary and for the blue team but then any dates for any teams that lexicographically after (not including) blue team.
Items closed lexicographically after "2021-01" (which will include all January dates) for the blue team?
Items closed lexicographically after "2021-01" (which will include all January dates) and for teams that lexicographically after (not including) the blue team?
Items closed lexicographically after "2021-01" (which will include all January dates) and for the blue team but then any dates for any teams that lexicographically after (not including) blue team.
Items that lexicographically sort after (not including) the key that can be made using just { team: "blue", closed: "2021-01" }, which will return all records for the blue team and all January dates?

When providing this:

tasks.query
  .backlog({ project })
  .gt({
    team: "data",
  })
  .go();

Which of the following is their intent?

Items that have teams that lexicographically sort after (not including) the data team?
Items that have teams that lexicographically sort after (not including) the key that can be made using just { team: "data" }, which will return all data team records?

What if we actually have four teams: data, datascience, database, platform:

tasks.query
  .backlog({ project })
  .gt({
    team: "data",
  })
  .go();

Which of the following is their intent?

Items that have teams that lexicographically sort after (not including) the data team, returning then datascience, database, and platform team items
Items that have teams that lexicographically sort after (not including) any team that starts with data, returning then only platform team items
Items that have teams that lexicographically sort after (not including) the key that can be made using just { team: "data" }, which will return all "data" team items?

In the example where closed is provided where team is not, resulting in an empty sort key:

tasks.query
  .backlog({ project })
  .gt({
    closed: "2021-01",
  })
  .go();

If a filter is added in the above case, something like closed > "2021-01", would it be surprising to you that the items returned would include January dates?

Let me know your thoughts on the above scenarios. How folks use the sort key operators is a bit of mystery to me, so I'd love to get your take and initial intuitions. Ideally this thread will also serve as a way others can weigh in as well.

EDIT: I've edited this comment a few times now for clarity, hopefully done now.

PaulJNewell77 commented 1 year ago

Apologies for the delay. My mental note to return to this finally resurfaced and I'm glad it did as there's lots of important things to unpack here. I think I'll tackle the cases in the order of easiest to hardest (from my point of view).

So firstly, I think most developers think about lexicographical sorting in the same way. And so, if they queried for strings greater-than data from the set data, datascience, database, platform, they would expected to be returned all bar the first (and if they queried using greater-than-or-equal-to, they would get all of them). If you find yourself needing a query like 'give me everying after values that begin with some string' then maybe you've designed your data wrong. (I feel there needs to be a liberal sprinking of IMHOs throughout all of this 😊)

Following on from that to talk about sorting of dates. I totally feel that pure lexicographical sorting is the most expected outcome. As such, greater-than 2021-01 would include all the dates in Jan. Similarly, for date/times, I expect greater-than 2021-01-01 to include all the date/times on the 1st Jan and thereafter. I must admit I noticed the 'incrementing' approach used by ElectroDb before but didn't realise the implications until I saw this example. That is, when using .gt with say 2021-01, ElectroDb creates a key condition of greater-than-or-equal-to 2021-02 rather than greater-than 2021-01. Is the reason for this based on assumed user intention, rather than technical? For me, it wouldn't give the result I expected.

Finally, coming to compound sort keys, I must admit I've never had cause to use one. I would say that if you've designed a GSI with a sort key of the form ${team}#${closed}, then you know you can only sort by closed date within a specific team, and that if you wanted to do something different then you need a different GSI. Having said that, I don't think there is any obvious common interpretation of your query:

tasks.query
  .backlog({ project })
  .gt({
    team: "blue",
    closed: "2021-01",
  })
  .go();

Something like this might be more intuitive:

tasks.query
  .backlog({ project })
  .eq({
    team: "blue",  
  })
  .gt({
    closed: "2021-01",
  })
  .go();

I would expect this to have the intent of your number 4 option:

Items closed lexicographically after "2021-01" (which will include all January dates) for the blue team?

I hope that makes some sense and is of some use. Be interested to hear your further thoughts.

PaulJNewell77 commented 1 year ago

Further to this I've been playing a bit more and have to return to my last question, but with a different example. This query:

tasks.query
  .projects({ team: 'blue' })
  .gt({
    project: 'p1'
  })
  .go();

Results in this:

{
    "TableName": "your_table_name",
    "ExpressionAttributeNames": {
        "#project": "project",
        "#pk": "pk",
        "#sk1": "sk"
    },
    "ExpressionAttributeValues": {
        ":project0": "p1",
        ":pk": "$taskapp#team_blue",
        ":sk1": "$tasks_1#project_p2"
    },
    "KeyConditionExpression": "#pk = :pk and #sk1 >= :sk1",
    "FilterExpression": "#project > :project0"
}

So, I've asked for projects > 'p1' but I get projects >= 'p2'. These are not equivalent. Any string beginning 'p1' (such as 'p1a' will get returned by the former but not the latter. I'm not sure why it's assumed that it's the latter that's intended. (The equivalent assumption is made for lte).

rcoundon commented 11 months ago

Sorry to bump @tywalch - just in case you hadn't seen this - the behaviour of this might trip some people up

tywalch commented 11 months ago

Thanks @rcoundon, I have been mulling this over but I wanted to avoid making a rash change; the removal of this behavior outright is [obviously] a breaking so it's no small change. It has been a very busy time on my end, and it has been hard to give this the time needed to think about it thoroughly. Any detailed thoughts you have on the philosophies you'd expect would be a huge help. I will try to put aside some time soon to write on this topic in this thread.

rcoundon commented 11 months ago

Thanks @tywalch. Totally understand, and I didn't mean to put pressure on you.

I'll have a chat with Paul tomorrow and we'll gather some thoughts to help out

tywalch commented 11 months ago

Thanks @tywalch. Totally understand, and I didn't mean to put pressure on you.

No, no! I don't feel pressured, I just mention it because I want to be transparent. I really appreciate how much you both have helped push the library to where it is today. It is so awesome you guys take the time to communicate your use cases and actually want to see things improved!

andrestone commented 3 months ago

IMHO, FilterExpressions should be handled by the user whenever possible. I also think users should use the query APIs aware of the fact that the composite SK's need to be sequentially defined and follow lexicographical sorting. I believe any abstraction that deviates from this comes with a high cost.

However, it really looks like this is as breaking as it gets. So I think these are some options:

a) Release a major version with these changes. b) Add other operations (or options to the operations) that support all key condition expressions as they are. c) Add a escape hatch / lower level API for queries that allows leveraging the schema info in order to construct all queries possible.

tywalch commented 3 weeks ago

Howdy @rcoundon and @PaulJNewell77 👋

I just published 3.0.0, with the significant breaking change being that I removed the character shifting and filtering logic by default.

rcoundon commented 3 weeks ago

Howdy @rcoundon and @PaulJNewell77 👋

I just published 3.0.0, with the significant breaking change being that I removed the character shifting and filtering logic by default.

Thanks for the heads up. I'd noticed the discussion. So filtering needs to be done in userland now?

tywalch commented 3 weeks ago

Yes, by default attribute filters won't be added here. There is a compare: "attributes" option that will add filters (but won't perform any character shift). You can see the impact on parameters here: https://electrodb.fun/?#code/PQKgBAsg9gJgpgGzARwK5wE4Es4GcA0YuccYGeqCALgUQBYCG5YA7llXWAGbZwB2MXGBDAAUKKwBbAA5QMVMAG8wAUT5V2ATzABfbhiiSwAIkRwAxlQMwARsYDc481D64FVBjYSkAvCc1QqBgA+h5ecMF8DJJwDk4ubqoyCFCaJGB+fHAsqupaABSiYEpFxWCSsIgAXCVlZfwaVJo1xirJqSTG+KV1xBgAbljmcC0Ayh5cXACC0tJdPWX9mLhYLi0AjMYLOt11DFRWWDaoVHg1igvFcgDmDHxYAF77q3wAkjDnl2VN0iMmbtg+NctnVijsvnB2mk4O9PqDvppfi0AVggSDQeD4QZvHD4WAfn8ANrGeD9Lr-firDDk4xuBiTGnSQHmLDSBgIYwAXTADCEzlcVF2GKFdToWHIMCmVFx8IJNQA5CigfKRWUdNtVaj4AAPM61UF0yYy0HSADWxvhzhkUBWpxqxJud0ezxc7y5qvhXBwCA+JjN6PhmLxuHN+rxYCtsltROMYolUppkOkKWhbs5HtBXsQvtppoDGK+6rqRd0QuUYW8ulEAEpHKJ+YlHfcnhpXTAMmB5Td1vLHA2FHG4JKFH55QAmAAMk4AtBP1rOe3W2smOnAAHRoTCaNeGrj5ZRN52tt7tnTVtfXKj7sCD4e6c-XKDXyNMP7y-pj+X3xzLlMkDfoBg267teh4ti87z3heV7KLeUpQY+z6GGy5AKvshzHKcuBfmeP5Qv+m5ATuEx7geGC3M2LonghME3uKQ7wWeF5PsoL6oZ2ppwJo2HfqIv6rgBW7EfSpFgGBVGQUxl7XnBChSU+tZAA

tywalch / electrodb

Query validation error - gt and lte create erroneous filter expression #228