hasura / graphql-engine

Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
https://hasura.io
Apache License 2.0
31.18k stars 2.77k forks source link

[v2.0.7-cloud.2] order_by not working #7453

Closed marcel-happyfloat closed 3 years ago

marcel-happyfloat commented 3 years ago

Just yesterday I fixed a bug, where my order of participants changed randomly while having a subscription running.

My issue back then was the exact same "created_at" date for some of the participants (initial participants). So I fixed my issue by adding another order_by column, e.g. order_by: {created_at: asc, user_id: asc}.

But today the query started to randomly update and change the order of my participants again... I saw that my cloud version got upgraded this night from 2.0.5 to 2.0.7! I also tested to just oder by user_id now and this also does result in some randomness, which must not happen.

Maybe important information: I'm ordering on-top of a custom query function, but for some standard relationship.

rikinsk commented 3 years ago

This is due to a behaviour change introduced in 2.0. See the Breaking changes section in the release notes here for details.

order_by expects an array of entries and not an object. You would need to update your query to use order_by: [{created_at: asc}, {user_id: asc}] instead to get the expected behaviour.

Do let us know if this was indeed the issue

edit: the actual issue was unrelated and indeed a bug in v2.0.7-cloud.2

marcel-happyfloat commented 3 years ago

Ah, I read about that. But unfortunately it doesn't solve that problem :(. Also it was working in 2.0.5-cloud.x before.

marcel-happyfloat commented 3 years ago

⚠️ Also my chat messages are jumping randomly now... something is completely broken in the new cloud version! ⚠️ They have been working fine forever, just order_by created_at.

rikinsk commented 3 years ago

@happyfloat just to confirm, you tried the query with the order_by passed as an array and that also resulted in inconsistent results order? just to reiterate, there is no guarantee that the query you initially mentioned will work. If it did work in 2.0.5 it could have been a random coincindence.

rikinsk commented 3 years ago

I'm ordering on-top of a custom query function, but for some standard relationship.

can you share a bit more details on this. a few more details on what your exact query and your schema is might help us figure this out quicker

marcel-happyfloat commented 3 years ago

I'm quite busy today with another project, but I try to investigate / give examples later today.

Yes, I tested it with the array and it didn't work, but it also doesn't work for a single order column now too... Three different subscriptions and all started to jump around last night (presumably after the cloud update to 2.0.7). Still happening now. I'm glad I did postpone my production release by some weeks :D.

Will investigate!

lukeomalley commented 3 years ago

We are experiencing the same issue on v2.0.7-cloud.2.When passing either an array of objects or a plain object the results are not sorted.

@rikinsk Here is our query that is not respecting the order_by clause:

subscription SUBSCRIBE_TO_COMMENTS_BY_IDENTITY_ID(
    $identityId: Int!
    $limit: Int!
    $logTypes: [String!]!
  ) {
    Comments(
      where: {
        identity_id: { _eq: $identityId }
        author_id: { _is_null: false }
        ActivityLogType: { type: { _in: $logTypes } }
      }
      order_by: [{ created_at: desc }]
      limit: $limit
    ) {
      id
      text
      timestamp
      created_at
      updated_at
    }
  }
24601 commented 3 years ago

I work on the same team as @lukeomalley and another question we have around this is can we disable/disallow/opt-in/opt-out of auto "upgrades"?

For production environments, version locking is critical for reliability. Bugs happen, even the best tested software, it is not a very desirable function to silently change your tenants' version even if you think it's an "upgrade" or a "fix". Usually it works and isn't noticed but this issue highlights the problem with that approach.

I think the only even remotely reasonable exception for changing anything about the code of our runtime would be a very, very severe, massive security or data integrity zero-day that had data loss or theft or security escalation implications, and even then, we should be notified of that upgrade immediately, and auto-upgrade should NOT be the default by any means.

24601 commented 3 years ago

I'm ordering on-top of a custom query function, but for some standard relationship.

can you share a bit more details on this. a few more details on what your exact query and your schema is might help us figure this out quicker

@rikinsk We have found that this problem only manifests itself when there is a nested data type in the selection set, and the resultant SQL involves a join.

I am not able to share the GraphQL queries that show this nuance or the compiled SQL from Analyze publicly, but am happy to add them to our support ticket to ensure proper handling of our data (request 647 is the open ticket).

*Not all selection sets with nesting cause problems, only specific ones.

@happyfloat - do you have a nested selection set and if so, does removing it cause the problem to go away?

24601 commented 3 years ago

@rikinsk - This is not a coincidence. Suggesting it is such is, at best, deflecting responsibility for a change that's broken your paying production customers' systems.

While I appreciate where you are coming and I do believe your intentions are good & helpful on the below (the sentiment that "we corrected a random occurrence so this isn't a problem/bug") that you guys "corrected" it post 2.0.5, the core issue remains and is not a "random coincidence" at all, nothing in production software should be a coincidence, it should be deterministic, replicable, and predictable. If it is not, that is a defect without question or exception, and it is a semantic debate that to suggest otherwise : when your production customers create software on your platform/against your API/platform/DSL and then, even if it is "faulty" but your release handles it properly, and a change then breaks that result, it's still a breaking change. The fact that prior to 2.0.5 something worked that maybe "shouldn't have" that was then changed post 2.0.5 is not the customer's problem per se, because we verified our software was functioning properly against a given version (whether that was Hasura's intent or not doesn't make it a coincidence, it makes it a poorly planned change).

Yes, this is not particularly making me happy, which is of no consequence, what IS of consequence is that are you seriously suggesting this is Hasura's approach to change control? That is not acceptable in production/enterprise systems, at least none I have worked on and not acceptable answers for production systems that I will recommend we continue to use.

Our production environment is heavily committed to Hasura, so we can't and don't want to leave, and I believe Hasura is easily the best GraphQL engine out there (for PG or otherwise), but if Hasura's support line on this is "this is a coincidence, it's your problem" rather than immediately remedying this with a downgrade path and an immediate option to opt-out of auto "upgrades" (the quotes are because it isn't an upgrade when it creates production problems for us, it's a massive downgrade, actually), we have a serious division on how we see the idea of "reliability" and change control that we need to understand, and a divergent definition of paid support.

Can you shed light on Hasura's official stance on version upgrades, version locking, and production system reliability commitments? We are paying customers in both production and development, and I can provide troubleshooting data that shows that the issue is, indeed a defect in Hasura code from 2.0.5 and that our queries are built to-spec compliant (arrays in the order_by, etc) and they pass validation, too, yet do not provide properly sorted results. I have "working" compiled SQL and "broken" compiled SQL that shows the difference and should be helpful to your engineering team.

I am happy to help resolve this, but please don't try to write this off as a "random coincidence", let's call it a spade a spade, this is a bug Hasura introduced into our (and seemingly others') production instance(s) without our opt-in to the new version, and we need to ensure this will never happen again, OK?

Not that a bug won't happen again, this stuff happens in complex software systems (hell, we have plenty of bugs! We are far from perfect!), we get it, but that we are never auto-upgraded ever again at the very least.

@lukeomalley's above query does not have the order_by in an array (which, yes, is out of spec apparently) but even in our updated code that uses an array of sort objects as specified: the issue remains, the array vs property map distinction is a red herring here that does not have any bearing on if this error occurs, although would certainly make sorting non-deterministic so I fully support the change to an array for sorting, but that brings up another point, why isn't query validation on query/subscription submit erroring out on order_by clauses that aren't compliant to spec with the "order_by must be an array" change? Seems to me that would solve that issue quite quickly, but, again, that's a separate tangent, as this issue occurs even when order_by is properly formed to the new 2.X spec. A user-friendly implementation of this logic would be perhaps something like this:

Is order_by data an array? --YES: We're good, continue --NO: -----Is it an SQL-style string? ---------YES: Cool, let's just parse it as a convenience ---------NO: Error telling user life is different in 2.x, check the docs! -----Is it a property map with precisely ONE property with a string value of DESC or ASC? ---------YES: Cool, let's just make wrap it in an array for them and return with a warning message kind of like:

{
    "data": {
        "XXXXX": XXXXXX
    },
    "extensions": {
        "warnings": [
            {
                "message":"Your order_by value was not an array, in 2.X all order_by clauses must be arrays (https://xxxxxx/docs). We've wrapped your value in an array as a courtesy. This behavior may break soon."
            }
        ]
    }
}

---------NO: Throw an error.

@happyfloat just to confirm, you tried the query with the order_by passed as an array and that also resulted in inconsistent results order? just to reiterate, there is no guarantee that the query you initially mentioned will work. If it did work in 2.0.5 it could have been a random coincindence.

rikinsk commented 3 years ago

@24601 I guess that was a poor choice of words from my end based on my understanding of the issue. I should have been more clear. Apologies.

What I actually meant by the "random coincidence" was that as the GraphQL spec mentions that entries in an object are unordered, when you pass the following argument order_by: {created_at: asc, user_id: asc} the choice of which field gets picked first can be random. In some instances the created_at will be used to sort first but it is possible in the next iteration user_id might be used to sort first. According to the spec there is no guarantee of consistency here. As this change was introduced since v2.0.0, I meant that possibly while using v2.0.5 created_at was probably getting picked first as a random coincidence. Hope this clarifies what I meant.

24601 commented 3 years ago

@rikinsk - understood, I make the same kludgy word choices and sure I will do the same in the future sometime, too! All good.

I really want to get this issue fixed, and this isn't about an array vs property map structure issue, that is a red herring. This is a bug, it has nothing to do with arrays, I can replicate this issue in the console using order_by as an array.

How do we resolve this?

@24601 I guess that was a poor choice of words from my end based on my understanding of the issue. I should have been more clear. Apologies.

What I actually meant by the "random coincidence" was that as the GraphQL spec mentions that entries in an object are unordered, when you pass the following argument order_by: {created_at: asc, user_id: asc} the choice of which field gets picked first can be random. In some instances the created_at will be used to sort first but it is possible in the next iteration user_id might be used to sort first. According to the spec there is no guarantee of consistency here. As this change was introduced since v2.0.0, I meant that possibly while using v2.0.5 created_at was probably getting picked first as a random coincidence. Hope this clarifies what I meant.

rikinsk commented 3 years ago

This indeed seems to be a bug. We'll investigate and get back to you soon.

24601 commented 3 years ago

@rikinsk - also, on the issue of the Array vs. Property Map order_by value, the query validator should throw an error or at least a warning (using the graphql extensions object as I highlighted above) to users who submit anything but an array.

Taking malformed input and providing back non-deterministic return data is another bug, IMO. Separate issue, should I open another one for this?

@lukeomalley's above query does not have the order_by in an array (which, yes, is out of spec apparently) but even in our updated code that uses an array of sort objects as specified: the issue remains, the array vs property map distinction is a red herring here that does not have any bearing on if this error occurs, although would certainly make sorting non-deterministic so I fully support the change to an array for sorting, but that brings up another point, why isn't query validation on query/subscription submit erroring out on order_by clauses that aren't compliant to spec with the "order_by must be an array" change? Seems to me that would solve that issue quite quickly, but, again, that's a separate tangent, as this issue occurs even when order_by is properly formed to the new 2.X spec. A user-friendly implementation of this logic would be perhaps something like this:

Is order_by data an array?
--YES: We're good, continue
--NO:
-----Is it an SQL-style string?
---------YES: Cool, let's just parse it as a convenience
---------NO: Error telling user life is different in 2.x, check the docs!
-----Is it a property map with precisely ONE property with a string value of DESC or ASC?
---------YES: Cool, let's just make wrap it in an array for them and return with a warning message kind of like:

{
    "data": {
        "XXXXX": XXXXXX
    },
    "extensions": {
        "warnings": [
            {
                "message":"Your order_by value was not an array, in 2.X all order_by clauses must be arrays (https://xxxxxx/docs). We've wrapped your value in an array as a courtesy. This behavior may break soon."
            }
        ]
    }
}
---------NO: Throw an error.

@24601 I guess that was a poor choice of words from my end based on my understanding of the issue. I should have been more clear. Apologies.

What I actually meant by the "random coincidence" was that as the GraphQL spec mentions that entries in an object are unordered, when you pass the following argument order_by: {created_at: asc, user_id: asc} the choice of which field gets picked first can be random. In some instances the created_at will be used to sort first but it is possible in the next iteration user_id might be used to sort first. According to the spec there is no guarantee of consistency here. As this change was introduced since v2.0.0, I meant that possibly while using v2.0.5 created_at was probably getting picked first as a random coincidence. Hope this clarifies what I meant.

rikinsk commented 3 years ago

on the issue of the Array vs. Property Map order_by value, the query validator should throw an error or at least a warning (using the graphql extensions object as I highlighted above) to users who submit anything but an array.

Our decision to not throw an error for such queries was because such queries were supported in v1 and we didnt want to start throwing errors for queries that used to work earlier. Results with bad ordering seemed more acceptable than outright errors in case someone had missed updating their existing queries before updating to v2.0.

But your point is valid. Maybe a Warning in some form seems like the best option here. We'll discuss this as well to improve the experience around this

rikinsk commented 3 years ago

Update: We have identified the root cause of this issue and we'll be making a release on cloud with the fix very soon

24601 commented 3 years ago

@rikinsk - thank you for thinking about it and hearing the idea! I appreciate it.

I can definitely appreciate and agree with the "let's keep it backwards compatible". I think a warning would be good in the extensions key facility of the GraphQL response, and perhaps, for Cloud customers, since you know our technical contact emails, a once-or-twice only email about "Hey, we noticed some of your queries are a bit out of date" with the operation name/query name (very useful in large projects/code bases), or even just a message on our console the next time someone logs in could be helpful since it is likely the actual extensions key response is not read by a human most of the time unless there's an out-right error.

Thanks again for your guys' responsiveness on this.

on the issue of the Array vs. Property Map order_by value, the query validator should throw an error or at least a warning (using the graphql extensions object as I highlighted above) to users who submit anything but an array.

Our decision to not throw an error for such queries was because such queries were supported in v1 and we didnt want to start throwing errors for queries that used to work earlier. Results with bad ordering seemed more acceptable than outright errors in case someone had missed updating their existing queries before updating to v2.0.

But your point is valid. Maybe a Warning in some form seems like the best option here. We'll discuss this as well to improve the experience around this

24601 commented 3 years ago

Update: We have identified the root cause of this issue and we'll be making a release on cloud with the fix very soon

@rikinsk - many thanks, you guys are jumping on this and we see and appreciate that deeply. Many thanks.

rikinsk commented 3 years ago

for Cloud customers, since you know our technical contact emails, a once-or-twice only email about "Hey, we noticed some of your queries are a bit out of date" with the operation name/query name (very useful in large projects/code bases), or even just a message on our console the next time someone logs in could be helpful since it is likely the actual extensions key response is not read by a human most of the time unless there's an out-right error.

@24601 Thats a great idea! We'll definitely do this. As a matter of fact, due to this behaviour change we had actually made the upgrade from v1.0 to v2.0 on Cloud as an opt-in as described here. We had also sent out emails to existing users that would have been impacted by this behaviour change so that they could make the relevant changes to their queries before updating. The idea to notify users who might still be using the older query format is something we'll definitely do.

On a more general note, we at Hasura strive to not introduce any breaking changes and go out of our way sometimes to ensure backwards compatibility at all times. In this particular instance though, on our path to v2.0, we had a major refactor in the query parsing logic in an attempt to adhere to the GraphQL spec as closely as possible which made us lose the ability to preserve backwards compatibility for this particular behaviour. It was indeed unfortunate.

rikinsk commented 3 years ago

Update: We have identified the root cause of this issue and we'll be making a release on cloud with the fix very soon

The release is now complete

rikinsk commented 3 years ago

Closing this as the fix was released in v2.0.7-cloud.3

24601 commented 3 years ago

for Cloud customers, since you know our technical contact emails, a once-or-twice only email about "Hey, we noticed some of your queries are a bit out of date" with the operation name/query name (very useful in large projects/code bases), or even just a message on our console the next time someone logs in could be helpful since it is likely the actual extensions key response is not read by a human most of the time unless there's an out-right error.

@24601 Thats a great idea! We'll definitely do this. As a matter of fact, due to this behaviour change we had actually made the upgrade from v1.0 to v2.0 on Cloud as an opt-in as described here. We had also sent out emails to existing users that would have been impacted by this behaviour change so that they could make the relevant changes to their queries before updating. The idea to notify users who might still be using the older query format is something we'll definitely do.

On a more general note, we at Hasura strive to not introduce any breaking changes and go out of our way sometimes to ensure backwards compatibility at all times. In this particular instance though, on our path to v2.0, we had a major refactor in the query parsing logic in an attempt to adhere to the GraphQL spec as closely as possible which made us lose the ability to preserve backwards compatibility for this particular behaviour. It was indeed unfortunate.

When can we expect this, we woke up to ANOTHER auto-upgrade that did the SAME thing.

It is beyond critical that Hasura STOP TOUCHING PRODUCTION INSTANCES.

Warn us about needing to upgrade, even give us deadlines on critical security fixes, but you CANNOT run a paid production service where you silently, overnight break the customer's app and worse, not give them a way to revert it at all, and then take HOURS or a fully day to fix it (or more!). Now twice in one month!

jflambert commented 2 years ago

This is due to a behaviour change introduced in 2.0. See the Breaking changes section in the release notes here for details.

order_by expects an array of entries and not an object. You would need to update your query to use order_by: [{created_at: asc}, {user_id: asc}] instead to get the expected behaviour.

Wow. This bit me SO hard today. I found this thread via google for "hasura broken order_by" or something like that... I'm disliking graphql more and more!!!

24601 commented 2 years ago

@jflambert This isn't a GraphQL problem as much as it is how Hasura implemented it then decided to make a breaking change in a bug fix release without documenting it, telling anyone about it, and auto-upgrading their cloud customers to it.

That isn't a GraphQL problem, that's a really poorly thought out software development, deployment, and customer care process.

REST, GraphQL, SOAP, carrier pigeon or smoke signals, doesn't matter if the humans making the decisions make really poorly thought out or rushed decisions or violate the basic tenets of rational thought & action, even the best technology will be junk.

This was 100% a human & process failure that was preventable by Hasura. Without a doubt. And it wasn't the first time they did something like this, either.

Bugs happen, I get that, but this wasn't even a bug, it was negligent planning & design with multiple failures along the way.

Never. Ever. Ever. Make. Breaking. Interface. Changes. In. A. Bug. Fix. Release. (Even if it fixes a previously released "bug"!)

Document your changes.

Deprecate things before you change behavior.

Issue warnings.

Version lock your cloud customers and don't auto upgrade them.

They failed on every single possible protection here. Mere weeks after they did exactly the same thing on a completely different issue.

On Fri, Dec 10, 2021 at 6:23 PM jflambert @.***> wrote:

This is due to a behaviour change introduced in 2.0. See the Breaking changes section in the release notes here https://github.com/hasura/graphql-engine/releases/tag/v2.0.0 for details.

order_by expects an array of entries and not an object. You would need to update your query to use order_by: [{created_at: asc}, {user_id: asc}] instead to get the expected behaviour.

Wow. This bit me SO hard today. I found this thread via google for "hasura broken order_by" or something like that... I'm disliking graphql more and more!!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hasura/graphql-engine/issues/7453#issuecomment-991402437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI2QV34B6I2HM53KJQYSSLUQKRZRANCNFSM5CW7NTRQ .