hasura / graphql-engine

Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
https://hasura.io
Apache License 2.0
31.08k stars 2.77k forks source link

Production Hasura Cloud Should Not Auto-Upgrade #7568

Open 24601 opened 3 years ago

24601 commented 3 years ago

Version Information

Server Version: Any CLI Version (for CLI related issue): Any

Environment

Cloud

What is the expected behaviour?

Hasura cloud instances should not auto-upgrade themselves without opt-in, or in worst-case of zero-day extreme severity exploit fixes, with 24 hour notice for customers/tenant for testing.

What is the current behaviour?

Currently, Hasura automatically upgrades instances. This is NOT acceptable and runs contrary to every production system best practice.

We have already in a few short months of being Hasura cloud production customers, had TWO instances where HASURA introduced bugs that took down our application, directly impacting our customers, and we are HELPLESS to fix the issue because we can't downgrade and we didn't even KNOW about Hasura upgrading our instance until it broke!

Then, we have to wait a day for Hasura to fix it! This is patently unacceptable.

Any possible solutions?

STOP AUTO UPGRADING YOUR CUSTOMERS, in worst-case of zero-day extreme severity exploit fixes, with 24 hour notice for customers/tenant for testing.

For reference, the two bugs IN ONE MONTH ALONE that have taken down our production product running on Hasura with two paid instances are:

https://github.com/hasura/graphql-engine/issues/7557 https://github.com/hasura/graphql-engine/issues/7453

It is beyond critical that Hasura STOP TOUCHING PRODUCTION INSTANCES. No amount of testing and verification Hasura can do can replace customer acceptance testing against our specific workload, use cases, configuration, and data model. There are no exceptions to this principle in production.

Warn us about needing to upgrade, even give us deadlines on critical security fixes, but you CANNOT run a paid production service where you silently, overnight break the customer's app and worse, not give them a way to revert it at all, and then take HOURS or a fully day to fix it (or more!). Now twice in one month!

@coco98 - I raised this as a problem last time this happened, your team did not respond to this issue or advise a fix on this specific part of the issue, even though they fixed the issue, they did not fix the root cause, which is Hasura messing with our production instances, and of course, less than 30 days later, we're having the same problem. While I appreciate their response on the bug they introduced, they did not pay due attention to the root cause. They are not taking it seriously. Please know that this will cause you to lose production customers, and it is an incredibly basic, easy thing.

It is a win for you guys because you don't get irate, panicked customers who are losing business themselves because you broke their app they trusted you with and a win for us because we don't have to spend time writing nasty GitHub issues for things that were preventable!

I keep pointing to the actual root cause as being "don't touch production systems without notice or opt-in to an upgrade the customer has had a chance to test in their environment and workload, no matter how small" and Hasura keeps breaking our systems, what is this? Sadism? I feel like I'm taking crazy pills.

fitzerman commented 11 months ago

I second this completely. My production server has been taken down multiple times by breaking changes to generated names of certain types in the graphql schema...