urql-graphql / urql

The highly customizable and versatile GraphQL client with which you add on features like normalized caching as you grow.
https://urql.dev/goto/docs
MIT License
8.65k stars 454 forks source link

Subscription cache updates triggered during a mutation can cause cache tearing #1525

Closed Kingdutch closed 3 years ago

Kingdutch commented 3 years ago

I may have found a race condition in Graphcache updates for cache updates that are triggered by both a mutation and a subscription. Both are needed for a chat where messages are sent that need to be synced across multiple devices.

In particular I believe that it's caused by multiple rapid mutations causing the subscription updater to be called before the updates for a successful mutation which in turns breaks reconciliation.

I've been able to track it down to a specific cache manipulation function (added at the end of this post for completeness). The function appends the new message to a message list in a conversation, if the message does not already exist in the list.

This function is called with a message from the result of a mutation and from the subscription that receives new messages (including ones sent by the client since the server can't link a subscription to the origin of a mutation).

This works fine when the mutation is sent slowly as can be seen by the log ouput that I added in the cache.

timestamp location __typename id cacheResolveResult filename:linenumber:column
19:17:10.452 mutation ChatMessage optimistic-a5e38364-adaf-4fe5-8884-a1e118bac338 null cache.js:115:18
19:17:11.122 mutation ChatMessage 29c86790-f904-46c4-af98-68d70fef4a8e null cache.js:115:18
19:17:11.219 subscription ChatMessage 29c86790-f904-46c4-af98-68d70fef4a8e null cache.js:185:18

When two mutations are sent in rapid succession however the execution is no longer linear and as a result it seems the ChatConversation that's updated in updateConversationMessageLists is turned into a null result in a component that depends on it causing the component to have no data where it expects data.

This is the log output in that scenario

timestamp location __typename id cacheResolveResult filename:linenumber:column
19:17:17.851 mutation ChatMessage optimistic-d569bd44-0ef5-41ec-95c2-868b5b6dee87 null cache.js:115:18
19:17:18.189 mutation ChatMessage optimistic-c94ddd7e-91d0-4f7b-9810-54e3ff8cfd99 null cache.js:115:18
19:17:18.477 subscription ChatMessage 160ea03f-70a7-4621-92e6-e2255bc7ef11 null cache.js:185:18
19:17:18.838 subscription ChatMessage cada3d6e-e8e9-497d-89f1-27b8bd512e0f null cache.js:185:18
19:17:18.942 mutation ChatMessage 160ea03f-70a7-4621-92e6-e2255bc7ef11 null cache.js:115:18
19:17:18.945 mutation ChatMessage cada3d6e-e8e9-497d-89f1-27b8bd512e0f null cache.js:115:18

urql version & exchanges:

├─ @urql/core@2.0.0 ├─ @urql/devtools@2.0.3 └─ @urql/exchange-graphcache@4.0.0

Steps to reproduce

  1. Create a cache update function that reads and writes from the cache
  2. Call the function in a mutation and a subscription updater
  3. Trigger a new subscription message on the server while processing the mutation with the same data
  4. On the client trigger two mutations in rapid succession

Expected behavior All cache updates happen correctly and the component that has queried the updated data renders as usual.

Actual behavior The component that has queried the data receives null causing it to fail its data dependencies, unable to show the user a meaningful screen.


function updateConversationMessageLists(cache, message) {
  const cacheKey = {
      __typename: message.conversation.__typename,
      id: message.conversation.id
  };
  const allFields = cache.inspectFields(cacheKey);

  // Find all message lists that were looking at the end of the conversation
  // where new messages would appear.
  const messageQueries = allFields.filter(field => field.fieldName === "messages" && field.arguments.last && !field.arguments.before);

  const fragment = gql`
    fragment _ on ChatConversation {
      __typename
      id
      messages(last: $last) {
        edges {
          node {
            id
          }
        }
      }
    }
  `;

  // Update all messages field instances for the affected conversation.
  messageQueries.forEach(({ arguments: fieldArgs }) => {
    const fragmentVariables = { last: fieldArgs.last };
    const data = cache.readFragment(
      fragment,
      cacheKey,
      fragmentVariables
    );
    if (data) {

      if (data.messages.edges === null) {
        data.messages.edges = [];
      }

      // We just ignore existing messages. We don't expect changes here.
      if (data.messages.edges.findIndex(({ node }) => node.id === message.id) !== -1) {
        return;
      }

      data.messages.edges.push({
        __typename: "ChatConversationChatMessageEdge",
        node: message
      });

      cache.writeFragment(
        fragment,
        data,
        fragmentVariables
      );
    }
  });
}
kitten commented 3 years ago

Without a reproduction I'll just have to rely on your description, but can you point out what turns null where, including multiple things, if so?

Kingdutch commented 3 years ago

Ah sorry I forgot to add a reference to the original discussion that contained a bit more information: https://github.com/FormidableLabs/urql/discussions/1492

Originally I thought this was just a case of partial data, although it's not mapped to ReasonUrql's PartialData type but to the Data type. Additionally it doesn't just happen when the stale flag is set to true (I was able to handle that case separately by falling back to a loading spinner).

The shortened version of the query (as posted in the discussion) is

query ConversationMessages($id: ID!, $messagesLast: Int!, $messagesBefore: Cursor) {
    chatConversation(id: $id) {
      ...Conversation_View @arguments(messagesLast: $messagesLast, messagesBefore: $messagesBefore)
    }
    viewer { id }
  }

Here chatConversation is set to null and viewer is populated. Even though as far as I can tell everything is in cache and the server sents what would be expected (and what worked when handling a single mutation at a time with the same code path).

I've added the full fragments in the following Gist as well as the state of the cache (using Urql devtools' explorer) that I still had open: https://gist.github.com/Kingdutch/b875483eb5673475f883b3b0827d171c

kitten commented 3 years ago

Ok, lest me reading raw cache inputs though, those aren't that important. The question is whether your updater is seeing the correct data, writing the correct links, reading / handling null values correctly, and whether your mutation and listing fragments overlap sufficiently

Kingdutch commented 3 years ago

This is the minimum mutation that was able to reproduce the error (only the function referenced above is left). The commented out parts are things I was able to eliminate.

        viewerSendUserChatMessage: (result, args, cache, info) => {
          // Our optimistic update doesn't include a clientMutationId but our result does,
          // so we use it to remove the optimistically added message.
          if (typeof result.clientMutationId !== "undefined") {
            // cache.invalidate({__typename: 'ChatMessage', id: 'optimistic-' + result.clientMutationId});
          }

          const message = result.viewerSendUserChatMessage.message;
          if (message === null) {
            ErrorHandling.report(new Error("Successful chat message send mutation did not contain a message."));
            return null;
          }

          const cachedMessage = cache.resolve({ __typename: message.__typename, id: message.id});
          console.log("mutation", message.__typename, message.id, cachedMessage);

          // Ensure we have a conversation to update.
          if (!message.conversation || !message.conversation.__typename || !message.conversation.id) {
            ErrorHandling.report(new Error("viewerSendUserChatMessage mutations must request conversation __typename and id to perform updates."));
            return null;
          }

          updateConversationMessageLists(cache, message);
          // updateViewerRecentConversationsList(cache, message.conversation.id);
          // markChatConversationRead(cache, message.conversation.__typename, message.conversation.id, message.sent.timestamp);
          // We have no reliable way of knowing how many messages were marked as read or unread so we can only invalidate the global unread count.
          // invalidateViewerChatReadStatus(cache);
        },

This is the minimum subscription with the same thoughts. I've not commented out the updateQuery part here because the conversation is present in the cache so I assumed the === null would never be true for my test case.

        chatMessageReceived: (result, args, cache, info) => {
          const message = result.chatMessageReceived;

          // We need a message __typename and id to perform cache checks.
          if (!message.id || !message.__typename) {
            ErrorHandling.report(new Error("chatMessageReceived subscriptions must request __typename and id to perform updates."));
            return null;
          }

          const cachedMessage = cache.resolve({ __typename: message.__typename, id: message.id});
          console.log("subscription", message.__typename, message.id, cachedMessage);

          // Ensure we have a conversation to update.
          if (!message.conversation || !message.conversation.__typename || !message.conversation.id) {
            ErrorHandling.report(new Error("chatMessageReceived subscriptions must request conversation __typename and id to perform updates."));
            return null;
          }

          // If this is the first message in a conversation then we must create the conversation.
          if (cache.resolve({ __typename: 'Query' }, 'chatConversation', {id: message.conversation.id }) === null) {
            cache.updateQuery(
              {
                query: `
                  query ($id: ID!) {
                    chatConversation(id: $id) {
                      __typename
                      id
                      name
                      messages(last: 1) {
                        edges {
                          node {
                            id
                            ...Chat_Message
                          }
                        }
                      }
                    }
                  }

                  ${Chat_Message.query}
                `,
                variables: {
                  id: message.conversation.id
                }
              },
              _ => ({
                chatConversation: {
                  ...message.conversation,
                  messages: {
                    __typename: "ChatConversationChatMessageConnection",
                    edges: [{
                      __typename: "ChatConversationChatMessageEdge",
                      node: message
                    }]
                  },
                }
              })
            )
          }

          updateConversationMessageLists(cache, message);
          // updateViewerRecentConversationsList(cache, message.conversation.id);

          // // If this is a new message from another user then we treat it as unread.
          // const viewer = getViewer(cache);
          // if (viewer !== null && viewer.id !== message.sender.id) {
          //   incrementChatConversationUnreadCount(cache, message.conversation.__typename, message.conversation.id);
          //   incrementViewerChatUnreadCount(cache);
          // }
        }

The question is whether your updater is seeing the correct data, writing the correct links, reading / handling null values correctly, and whether your mutation and listing fragments overlap sufficiently

I'm assuming these things are correct because for single messages everything is working as expected.

I was also not able to reproduce it in a scenario where the connection to the subscription was failing (so it was only the mutation which was frequently updating).

These things combined lead me to believe the issue is caused in the interplay of mutation and subscription when multiple mutations (and thus subscription messages) are triggered.

This is the event view for the two faulty mutations (and the subscription).

image

The top Q is a query from a portal that fetches a different query for viewer. When I disabled this query (removing the useQuery code) I was still able to reproduce the issue, so I excluded that as cause.

The S without event is for another subscription type for which no data is being sent in this scenario. The two mutation are shown and the query that receives partially null'ed data is shown right above that.

Kingdutch commented 3 years ago

I'm not sure if it's related but it seems to be an unexpected error in either case. I was trying to reproduce this with the debugger enabled and asked it to pause on exceptions which shows that there are circumstances in which graphcache can throw a type error.

image

I was able to track this down to a destructuring in inspectFields in data.ts: https://github.com/FormidableLabs/urql/blob/618f791a011bfb55de2a8d8bede3dc4132b4af75/exchanges/graphcache/src/store/data.ts#L556

Since this uses global state this could explain why the race condition occurs? One such place where I see an async clear of global state is at https://github.com/FormidableLabs/urql/blob/618f791a011bfb55de2a8d8bede3dc4132b4af75/exchanges/graphcache/src/store/data.ts#L146-L152

The firing moment of subscriptions are not controlled by Urql (is what I'm assuming since it originates from a websocket event). So it can occur that the planned cleanup of a mutation occurs during the processing of a subscription (or vice-versa). This would cause the data to be thrown out unexpectedly for the current operation being processed, corrupting the cache.

I've not been able to actually expose the error and get it output on the console by adjusting code or monkeypatching inspectFields, so there is some uncertainty here (I'm not sure where it goes/if there's a fallback). However, I was able to make the debugger stop on this type error multiple times in my reproduction.

Kingdutch commented 3 years ago

I've been trying to build a reproduction case with the codesandbox links provided (I missed those when reporting, sorry about that! Great addition since my last bug report, making a representative server is now a lot easier! ❤️ )

Client: https://codesandbox.io/s/subscription-on-mutation-client-d7w33?file=/src/cache.js Server: https://codesandbox.io/s/subscription-on-mutation-server-z1k7k?file=/schema.js

I'm not able to get the exact behaviour as I'm experiencing in my own application but I do see some (for me) unexpected things happening in the cache, that may cause the behaviour that I saw. Below is a short screen recording.

https://user-images.githubusercontent.com/327697/114166490-154f5c80-992e-11eb-8403-c9a0ad579063.mov

In the cache updates I'm mutating the incoming entity to append to the message where it arrived (either mutation or subscription). The idea was that since I only wrote new messages I could see where the message first arrived while trying to break it.

However, I see that for some reason, outside of my own cache updates, the cache is being changed with the original message from the server. Sometimes this overrides the optimistic update (which has a different ID) but mostly it overrides the cache writes from my caching logic (this happened even if I gave those a different ID).

Edit: If I disable the subscription in InternalApp.js the problem disappears and things work as expected. The cache now does throw away the optimistic update itself (I wasn't aware that happened and in my own code I manually invalidated the optimistic update, so that could cause my specific symptoms). This leads me to believe that the updates from subsequent mutations or from subscriptions are being seen as part of the optimistic update layer and then thrown away when the mutation completes. (If I correctly understand how the layers work)

kitten commented 3 years ago

Ok, that's expected and I'm actually happy that you're seeing this, believe it or not 😆 The reason why you'll never see tearing ("mixed optimistic and non-optimistic results") in a Graphcache app without lists and subscriptions is that we have a mechanism to prevent tearing there. This is explained in our "Optimistic Mutation" part of the docs now:

Note: While optimistic mutations are waiting for results from the API all queries that may alter our optimistic data are paused (or rather queued up) and all optimistic mutations will be reverted at the same time. This means that optimistic results can stack but will never accidentally be confused with "real" data in your configuration.

What this means is that optimistic mutations will be queued up until all of their mutations have settled and have received a server result. This means that we can safely "flush out" all of the optimistic changes and instead apply the real ones.

This cannot apply to subscriptions however. We realised that subscriptions are more time sensitive. Meaning, we can either queue all subscriptions that collide with the optimistic mutations up too and flush them at once (or flush them again and apply them optimistically). So that's a huge complexity add. We'd apply subscription results on optimistic layers (and we don't even have the means to apply them because we'd have to generate new IDs for their results), then when all optimistic mutations settle we'd flush them out and also apply them again, in order with mutations.

Instead, we realised that once subscriptions are in play, it's better to not do anything and instead don't apply this special behaviour. This is advantageous because once you have subscriptions, you already have timings you'd like to follow.

There are two problems here you have to address:

Both of these problems can be addressed by treating the entire list as unreliable. Instead, you'll need a way to identify the messages yourself. You could attach a small cuid to each message (since those are time-sensitive identifiers) which are sent in the mutation input for sending messages. This will mean that optimistic and non-optimistic messages will have an identifier that's identical. (So an additional ID field you can use)

This then means that the optimistic mutation can generate and write to the same object that non-optimistic mutations can, since the ID is known. On the server you'd probably choose to still generate / auto-generate your normal IDs, since those are "safe" and not user-controlled.

The optimistic updater then just, as usual, creates a message with the cuid and is appended. The non-optimistic updater does the same, except it'll not have to append. The subscription then also only has to append, as long as that "append" function checks that the cuid doesn't exist yet.

so, tl;dr: Subscriptions don't do any "magical" tearing protection, because that wouldn't quite give the "visual appeal" that you'd expect unless we'd put a lot of layers of logic in; even then, they'd probably stop feeling real time, if we did anything special.

Kingdutch commented 3 years ago

Okay right! However, I think that's then a limitation of my reproduction, not of the production code that had the original issue. Specifically in my production mutation I would do the following:

  if (typeof result.clientMutationId !== "undefined") {
            cache.invalidate({__typename: 'ChatMessage', id: 'optimistic-' + result.clientMutationId});
   }

(In my post above this is commented out to eliminate it as a cause for my problems). Additionally my updater function will not append a new message if it's already in the cache.

So that should address the two points that you raised (in my production code).

One thing that's unclear to me though is where in my recording the lines without any (optimistic) (mutation), (subscription) or (mutation) are coming from. If I look at the cache logic in my reproduction example then all entries should either end in (subscription) or (mutation) unless there is something that's modifying my cache which is not going through my sendMessage and newMessages function.

The subscription then also only has to append, as long as that "append" function checks that the cuid doesn't exist yet.

How would the subscription get the cuid? I wouldn't be too happy forwarding this between clients because if a malicious user crafts their cuid's correctly they could adjust other people's caches.

I've adjusted the example client to invalidate the optimistic update when the mutation completes. However, this only works if there is no competing subscription.

https://user-images.githubusercontent.com/327697/114177909-3ff4e180-993d-11eb-8fe0-ac6f6ae0f87c.mov

As you can see for the first few messages the optimistic update is properly removed. However, when I rapidly trigger multiple mutations then this stops working.

Perhaps I understand what's happening now, but not how I'm supposed to handle it?

Kingdutch commented 3 years ago

Okay, I think I get it.

When subscriptions apply an update, they'll "see" the optimistic items in the list and will permanently apply them. (Which can mean that they may even apply messages that will be erased in the future)

So this means that in my appendMessage function when it's called by the subscription, I have to filter out the optimistic updates so I don't accidentally apply them?

Adding this (which is true for subscriptions) seems to work. Sort of.

      if (filterOptim) {
        data.messages = data.messages.filter(
          (m) => !m.id.startsWith("optimistic-")
        );
      }

https://user-images.githubusercontent.com/327697/114178981-b5ad7d00-993e-11eb-8c1a-33c9b7787674.mov

What I'm confused about is that I don't understand where the messages without the (mutation) or (subscription) are coming from.

kitten commented 3 years ago

So this means that in my appendMessage function when it's called by the subscription, I have to filter out the optimistic updates so I don't accidentally apply them?

That doesn't really work. If you filter out all optimistic updates in your subscription, then on any subscription result you'll lose all optimistic messages. In other words, if a different user sends a message and that comes in as a subscription result, then you'd lose all pending message from the active user.

Also, invalidating the optimistic mutation item may not do what you expect it to do. It'd be a lot more effective if you'd simplify the problem by letting the normalised cache know the ID from optimistic mutation to mutation result ahead of time.

The messages without mutation and subscription? That's easily explained, I'd guess you have a subscription that updates a message of a completed mutation. If the mutation completes first then you'll have message + ' (mutation)'; however, the subscription comes in after, matches the ID, and updates the message automatically, due to the usual normalised cache semantics. Your manual update function won't do anything in that case, since the message is already in the list. That's expected, since you only did this for debugging.

Kingdutch commented 3 years ago

It'd be a lot more effective if you'd simplify the problem by letting the normalised cache know the ID from optimistic mutation to mutation result ahead of time.

I'm not sure how I'd do that? The server assigns IDs so the client can't predict what the ID will be for a mutation result. EDIT2: So I know this for mutations, because I have a clientMutationId which is returned by the server in a mutation result. So I use that to link up optimistic updates with the result of a mutation.

However, since the server assigns IDs, and I don't have a clientMutationId for subscriptions, I don't know how in the subscription message flow I'd know what the optimistic update is that this subscription message corresponds to.

That's easily explained, I'd guess you have a subscription that updates a message of a completed mutation.

In that case I'd agree with you. Except that's not true.

I'm updating the cache through two routes:

So as far as I understand I shouldn't see any new results without things appended. Unless the cache is secretly fetching query results in the background and somehow merging that in, but I've not been able to see any network activity that would point to this, nor would I understand why it does that (since my update functions should be sufficient).

EDIT: In the video you can also very shortly see (subscription) appended to some messages, before that is replaced.

kitten commented 3 years ago

Like I started with before, I'd generate an ID on the client-side and attach it. So generally I'd maybe send an additional cuid with the message. This could then be stored in the database (if it's relational, I guess you could even index it). Specifically, I'm recommending cuid because they're collision resistant enough across clients. The server-side would still generate its own id and use that as a primary key. However, on the client you could cache (in Graphcache using keys by that cuid instead).

So instead of linking things up with clientMutationId, I'd basically have a cuid that's always passed around and used for normalised caching.

In that case I'd agree with you. Except that's not true.

It is true. The subscription updater runs after the main subscription. So by the time you're trying to append an updated message the message has already been appended (most of the time) by the mutation and your subscription updater doesn't append anything new. This can often even work the other way around, where the mutation updates the subscription message. Remember, at some point your message is already in the list at which point your updaters aren't fully running and updating the message, because they abort when the message is already in the list.

Kingdutch commented 3 years ago

So instead of linking things up with clientMutationId, I'd basically have a cuid that's always passed around and used for normalised caching.

Okay that's something I'd have to think about. Storing that on our server for this specific case should be doable but there are other data types with similar constraints/real-time setups where storing this would be a non-trivial change to our architecture. I also think storage is a requirement to prevent malicious actors from sending duplicate cuids and messing up clients' caches that way.

It is true. The subscription updater runs after the main subscription. So by the time you're trying to append an updated message the message has already been appended (most of the time) by the mutation and your subscription updater doesn't append anything new. This can often even work the other way around, where the mutation updates the subscription message. Remember, at some point your message is already in the list at which point your updaters aren't fully running and updating the message, because they abort when the message is already in the list.

My mutation updater does

{
            ...message,
            message: message.message + " (mutation)"
          }

My subscription updater does

{
            ...message,
            message: message.message + " (subscription)"
          }

Yet the output shows a naked message.

I understand that my updaters abort if the other updater has already processed the message. However, what's the third case here that seems to be bypassing both of my updaters?

I only expect naked messages to happen when messages are fetched in a query from the server since that's neither a mutation nor a subscription, so those messages aren't modified by my updaters. When adding messages after the initial page load, the only network traffic I see are mutations and subscriptions.

EDIT:

I'm referring to 78/79 in this image.

image

(The coloring was added because some mutations were being added out of order, which was resolved by no longer invalidating optimistic updates but replacing those with the final mutation/subscription results as per your suggested cuid pattern)

kitten commented 3 years ago

However, what's the third case here that seems to be bypassing both of my updaters?

When the message is already present in the list and the normalised cache writes the new data automatically to the message entity directly. That's when the message updates without any of your updaters.

Edit: One idea I'd have would maybe work here too if you can't add a stable secondary ID for the clients. You can filter out optimistic messages (as you're already doing) but only if your current updater is non-optimistic, i.e. triggered by a mutation result. You can detect this by checking !info.optimistic

Edit 2: Then you'd only have duplicate messages if the subscription event comes in before the mutation result. You can make this less likely on the API by delaying the publish of the "new message" event until after the current mutation completes. You can also completely solve this by ignoring newMessage events that have originated from the current device. This detection is trickier without a client-side ID, but you could try to add a heuristic for this.

Kingdutch commented 3 years ago

When the message is already present in the list and the normalised cache writes the new data automatically to the message entity directly. That's when the message updates without any of your updaters.

Aaaaah I see, okay :)

Yeah unfortunately my subscription usually arrives (but not always) before my mutation result at the moment. Given that my subscriptions are served by a separate process from a message over a RabbitMQ queue and the message is put on the RabbitMQ queue before the mutation returns (since that's a PHP web request in Drupal that needs to finish its response). My constraints are, shall we say, interesting 🙈 .

Thanks for the help! At least this means I have some improvements to make in the caching code of the actual application 🎉 . I'll get back to that on Monday (today was my day off but didn't want to let you wait too long for my reply) and then I'll see if those improvements also solve my original problem (or whether I need to improve my reproduction sandbox).

Kingdutch commented 3 years ago

I ended up being impatient and working on it today anyway 😇

It took me a lot of trial and error but I think I managed to get it right. It was indeed the problem that we described here :D

I ended up adding a cuuid field to the message entities like you suggested (we'll have to look at this on a per-case basis how best to solve this, but this seemed most robust here). Then making sure that I replaced the optimistic update when a subscription came in ahead of the mutation result and skipping otherwise.

One thing that did trip me up along the way is that I was persisting the cuuid from the optimistic update but Graphcache didn't seem to accept that. I'm not sure why that is, could it be because cuuid is not in the original query in useQuery?

I was able to work around this because I also used the cuuid as temporary id in the optimistic update (prefixed with optimistic- so my front-end components know the difference) and then matching against the id that was stored in the Graphcache :)

It looks like when properly handling those values as you described it works very well and I wasn't able to break it with about 100 near simultaneous messages :D The only thing to make it more robust would be to figure out why cuuid was not persisted from the optimistic update.

Kingdutch commented 3 years ago

Just removed all the debugging code, did a redeploy and tried to break it again for good measure. Still held up so I'm calling it a win.

🎉