michalkvasnicak / aws-lambda-graphql

Use AWS Lambda + AWS API Gateway v2 for GraphQL subscriptions over WebSocket and AWS API Gateway v1 for HTTP
https://chat-example-app.netlify.com
MIT License
461 stars 91 forks source link

cleanup of subscriptions and connections doesn't seem to be working #83

Closed fridaystreet closed 4 years ago

fridaystreet commented 4 years ago

Not entirely sure if this is an issue or just working as expected. I'm just trying to get my head around how the cleanup should work as I seem to have lots of old subscriptions. Looking through the cloudwatch logs it looks like all the old subscriptions for a particular event are getting pulled down each time, then discarded. But essentially it's making the lambda function run longer and longer each time.

My understanding was that subscriptions and subscription operations are removed when they are stale like connections. That doesn't seem to be the case, well at least in our implementation.

looking at the code base I can see that unsubscribeAllByConnectionId is called in unregisterConnection, which is called

  1. when sending the message to the client fails
  2. when the $disconnect is received
  3. when stop message is received

I think there might be a couple of gaps there. Maybe it should be called when the Events processor is iterating through the events and executing the event for all the subscribers. If there is an error in that execution for a subscriber, I think maybe that should cause the subscription to be removed and the client informed so they can resubscribe.

Possibly in closeConnection in the connection manager, although my assumption here is that triggering the apigateway to close the connection is triggering the $disconnect, which in turn should be cleaning up.

So anyway, I'm not sure if there are maybe circumstances where relying on the 2 hour $disconnect from API gateway is failing through the cracks sometimes and subscriptions are left and never cleaned up. Possibly a check to see if the connection in the subscription is valid before executing it in the event processor? Or maybe that should be the role of the subscription manager, it shouldn't return any subscriptions to the event processor if they don't have valid connections?

fridaystreet commented 4 years ago

been thinking this through a bit and would I be correct in thinking that a ttl on the subscriptions and operations table would be just as well placed as the events table wouldn't it?

I mean the subscriptions and operations are null and void after 2 hours if AWS is going to ditch the connections anyway. Or maybe I missed something as to why they are needed to be kept?

Cheers Paul

michalkvasnicak commented 4 years ago

@fridaystreet you're right, they should be removed but this library relied on AWS AG to call $disconnect event, but it seems isn't guaranteed by AWS.

Maybe the combination of both is way to go. Do you see any parts where it could be breaking change? TTL is not a breaking change but requires you as a developer to set it up. Unsubscribing during event processing could help everyone just by updating the version of a package.

IslamWahid commented 4 years ago

same issue here!

IslamWahid commented 4 years ago

I see that when the page with subscriptions is open, every 10 mins there's a new connection opened with its subscriptions and the old one is not used anymore and that keep just adding to the subscriptions table with no real connection on the connections table.

Screen Shot 2020-06-29 at 19 20 19
fridaystreet commented 4 years ago

@michalkvasnicak good point on the id being reused. I'm not entirely sure. However, we have been running with just the TTL since I originally posted this and haven't experienced the issue again. I haven't had a chance to do any really in depth tracing or load testing, so possibly it could still be worth looking into a clean up during event processing as well, but certainly as a quick fix, the TTL is working well.

I don't think either TTL or the cleanup would cause a breaking change. Well for us it hasn't. Using Apollo client it just handles it and resubscribe to everything automatically.

Probably also worth noting, as per a previous conversation we were having about dynamo keys, we have made sure to use very specific subscription key names which include the ids of relevant objects (eg new_message_room_some_id), so as to ensure there are minimal reads from dynamo, rather than broad names like 'new_message' and then relying on subscription filtering post pulling all the subscriptions. This seems to scale and perform well so far.

michalkvasnicak commented 4 years ago

@fridaystreet thanks for the info. If I find a time on the weekend I'll try to implement removal of dead connections stored in table (ones that were not removed by AWS AG automatically by invoking $disconnect event). TTL is simplest solution so maybe it needs to be documented too.

Probably also worth noting, as per a previous conversation we were having about dynamo keys, we have made sure to use very specific subscription key names which include the ids of relevant objects (eg new_message_room_some_id), so as to ensure there are minimal reads from dynamo, rather than broad names like 'new_message' and then relying on subscription filtering post pulling all the subscriptions. This seems to scale and perform well so far.

Yes this is something we were discussing with @AlpacaGoesCrazy in #86 and also about using SQS as an event source.

michalkvasnicak commented 4 years ago

@fridaystreet I took a look and when sending to connection fails, the connection should be deleted and all of its subscriptions. So the problem (as you wrote) is in the event processing when processing of an event fails, then we never try to send anything to connection, so we don't know whether it's stale or not. We can't remove the connection in this case, because the error doesn't necessarily need to be caused by stale connection. So in case of an error we should try to check if connection is stale or not.

michalkvasnicak commented 4 years ago

@fridaystreet I'm not sure whether to check connection after event processing fails or before event is processed.

After event processing fails:

We will check the state of specific connection only if event processing failed.

Before an event is processed:

We will check the state of specific connection and only then we will process the event.

I think that for developers it's easier to think about event processing only on active connections. But on the other hand, checking the state of connection on each event can result in API throttling from AWS side.

fridaystreet commented 4 years ago

@michalkvasnicak yeah I agree with you and after reading your comments, it would make sense to only check on failure I think. However, that said, if the connection is canned (that is killed by telling api gateway to drop it which I think is what unregisterConnection is doing from memory) then the client will just resubscribe. Or at least in the case of apollo it will.

It's been a while since I was looking at this, but I think where I was going was that, if processing fails, just tell api gateway to kill the connection and delete it, then leave it up to the client to resubscribe.

But I can see where you're going with 'only check connection if it fails' I'm assuming somehow it would attempt to send a message and if it fails then delete it.

michalkvasnicak commented 4 years ago

@fridaystreet

But I can see where you're going with 'only check connection if it fails' I'm assuming somehow it would attempt to send a message and if it fails then delete it.

There is an API getConnection() on ApiGatewayManagementApi. So I was thinking that on failure we'd just use this API to check if the connection exists and unregister it. sendToConnection() already removes the connection and its subscriptions, see https://github.com/michalkvasnicak/aws-lambda-graphql/blob/master/packages/aws-lambda-graphql/src/DynamoDBConnectionManager.ts#L145.

fridaystreet commented 4 years ago

and yes agree also, I think for developers there would be an assumption that it wouldn't be processing stale connections. Again I guess with that I was thinking that if the stale connections were filtered out first, then you aren't processing all those connections only to find out they failed at the end. There could be all sorts of db lookups and stuff going on to process an event, it sort of feels like the lookup, in the beginning, could be a bit outweighed by the overhead of processing an event for stale connections only to determine at the end that the processing was in vain.

Possibly it's even a case by case thing based on how much resources it takes to process an event. Maybe it's best to only process valid connections, or maybe it's lightweight enough that stale connections aren't an issue.

Dare I say... could it do both based on dynamic configuration? :-)

fridaystreet commented 4 years ago

yep that's exactly what I was thinking, use unregisterConnection via send. Sorry just couldn't remember the exact line off top of my head.

fridaystreet commented 4 years ago

I'm a bit balls deep in trying to get v2 of our product out the door at the mo, but that is close to completion. I can't really do much at the mo, but I'd be happy to commit some time to your project after that to help out on anything you need. The project has really been quite pivotal in what we wanted to achieve.

michalkvasnicak commented 4 years ago

Dare I say... could it do both based on dynamic configuration? :-)

Yes I think this one can be based on configuration, it'll need some nice explanation in docs (maybe I should create some documentation website because README.md is not really a good place to point out everything).

I'm leaning to make the check before processing as a default, even if it breaks how it works now, so maybe if someone uses this on high volume of events there will be some throttling from AWS :(. But I'll make some research based on AWS limits.

I'm a bit balls deep in trying to get v2 of our product out the door at the mo, but that is close to completion. I can't really do much at the mo, but I'd be happy to commit some time to your project after that to help out on anything you need. The project has really been quite pivotal in what we wanted to achieve.

Would you mind sharing the product (or after finish)? :) Do you use this library or have own solution inspired by it?

fridaystreet commented 4 years ago

sorry for all the comments, it's provoking ideas. We've been running the ttl and it's made a massive improvement.

I reckon with the ttl on subscriptions etc and an attempted to send on event processing failure (effectively invoking unregisterConnection on failure), coupled with a filter on the dynamo call to get subscriptions to discard connections past the ttl it would be bang on the money.

I mention the filter because as you probably know, dynamo can take up to 48 hours to delete items past their ttl

fridaystreet commented 4 years ago

Would you mind sharing the product (or after finish)? :) Do you use this library or have own solution inspired by it?

yeah anything that we do in relation to the library I'm happy to share and contribute. Our product isn't opensource, so I can't share IP, but as I say, if it touches your library or is related I obviously consider those opensource contributions. There's also a fair few things we've had to consider in terms of how to handle authentication and how to structure the event names to get the best performance from dynamo, so happy to share those as well.

fridaystreet commented 4 years ago

oh sorry I just realised you were meaning just tell you what the product is. Yeah of course. it's called ALEiGN you can find it at http://aleign.com

all the info there references the current v1. It was an express based application built for an mvp. It wasn't massively scalable or consistent. v2 is a complete rewrite, based on serverless, graphql, and a new event-driven architecture. Along with the frontend being completely rebuilt.

The sales/marketing guys are working through updating all the screenshots and reference guides now

michalkvasnicak commented 4 years ago

yeah anything that we do in relation to the library I'm happy to share and contribute. Our product isn't opensource, so I can't share IP, but as I say, if it touches your library or is related I obviously consider those opensource contributions. There's also a fair few things we've had to consider in terms of how to handle authentication and how to structure the event names to get the best performance from dynamo, so happy to share those as well.

I'd very appreciate that because I wasn't really using this library (I've changed a job so at the moment I'm working on something completely different).

oh sorry I just realised you were meaning just tell you what the product is. Yeah of course. it's called ALEiGN you can find it at http://aleign.com

all the info there references the current v1. It was an express based application built for an mvp. It wasn't massively scalable or consistent. v2 is a complete rewrite, based on serverless, graphql, and a new event-driven architecture. Along with the frontend being completely rebuilt.

The sales/marketing guys are working through updating all the screenshots and reference guides now

Thank you for sharing! It's good to know that someone uses or is inspired by this library. 👍

IslamWahid commented 4 years ago

@michalkvasnicak @fridaystreet guys can you please explain it more

Screen Shot 2020-07-03 at 13 43 14 Screen Shot 2020-07-03 at 13 43 27
fridaystreet commented 4 years ago

@IslamWahid yeah I see the same thing every 10 mins, I think you're right that the connections are re-established every 10 mins through api gateway, but that is invoking the $disconnect which is causing this library to clean up the connections and subscriptions automatically.

I think it might be the case that inactive connections are reset after 10 mins, but if there is activity the inactive counter is reset and it keeps alive. But 2 hours is the hard limit maybe? regardless of activity after 2 hours, connections are reset. This is just an assumption, you could create a ping that sends a message every minute and see if it stays alive past the 10 minutes?

I only ever see the number of connections currently active in the db and only subscriptions for those connections from what I can see. My main problem is that if other things fail like event processing like we've been discussing, there seems to be some gaps where $disconnect isn't invoked by AWS, meaning connections and subcriptions aren't cleaned up, hence adding the ttl.

IslamWahid commented 4 years ago

@fridaystreet thanks for your reply and it would be great if you share your TTL quick solution.

michalkvasnicak commented 4 years ago

Yes there is 10 minutes idle connection timeout https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html.

michalkvasnicak commented 4 years ago

@IslamWahid I think that TTL index can be created on createdAt column in connection table.

michalkvasnicak commented 4 years ago

@fridaystreet do you apply TTL on subscriptions as well?

fridaystreet commented 4 years ago

@IslamWahid because I haven't had time to submit a PR to @michalkvasnicak I've basically just extended his classes to add the extra functionality for now and created our own custom versions of the sub manager and conn manager

Remembering you'll also need to add the config to serverless file to create the tables with a ttl.

class CustomDynamoSubscriptionManager extends DynamoDBSubscriptionManager {

  constructor(props) {
    super(props);

    this.subscribe = async (
      names,
      connection,
      operation
    ) => {
      const subscriptionId = this.generateSubscriptionId(
        connection.id,
        operation.operationId
      );

      // we can only subscribe to one subscription in GQL document
      if (names.length !== 1) {
        throw new Error('Only one active operation per event name is allowed');
      }
      const [name] = names;

      await this.db
        .batchWrite({
          RequestItems: {
            [this.subscriptionsTableName]: [
              {
                PutRequest: {
                  Item: {
                    connection,
                    operation,
                    event: name,
                    subscriptionId,
                    operationId: operation.operationId,
                    ttl: moment().add(2, 'hours').unix()
                  },
                },
              },
            ],
            [this.subscriptionOperationsTableName]: [
              {
                PutRequest: {
                  Item: {
                    subscriptionId,
                    event: name,
                    ttl: moment().add(2, 'hours').unix()
                  },
                },
              },
            ],
          },
        })
        .promise();
    };    
  }

}

const subscriptionManager = new CustomDynamoSubscriptionManager({
  dynamoDbClient,
  subscriptionsTableName: config.graphql.subscriptions.subscriptionsTableName,
  subscriptionOperationsTableName: config.graphql.subscriptions.subscriptionOperationsTableName
});

class CustomDynamoDBConnectionManager extends DynamoDBConnectionManager {

  constructor(props) {
    super(props);

    this.registerConnection = async ({
      connectionId,
      endpoint
    }) => {
      const connection = {
        id: connectionId,
        data: { endpoint, context: {}, isInitialized: false },
      };

      await this.db
        .put({
          TableName: this.connectionsTable,
          Item: {
            createdAt: new Date().toString(),
            id: connection.id,
            data: connection.data,
            ttl: moment().add(2, 'hours').unix()
          },
        })
        .promise();

      return connection;
    };
  }
}

const connectionManager = new CustomDynamoDBConnectionManager({
  connectionsTable: config.graphql.subscriptions.connectionsTable,
  apiGatewayManager: process.env.IS_OFFLINE
    ? new ApiGatewayManagementApi({
        endpoint: 'http://localhost:4001',
      })
    : undefined,
  dynamoDbClient,
  subscriptions: subscriptionManager,
});
fridaystreet commented 4 years ago

@IslamWahid I think that TTL index can be created on createdAt column in connection table.

would be nice, but the ttl has to be a unix timestamp

IslamWahid commented 4 years ago

@fridaystreet @michalkvasnicak thanks for your help, guys! I would really like to get that in the library so if you don't have much time I may prepare a PR for it.

IslamWahid commented 4 years ago

also, it would be great if we could prepare some guidelines and best practices like what @fridaystreet mentioned above to be documented.

michalkvasnicak commented 4 years ago

@IslamWahid

I would really like to get that in the library so if you don't have much time I may prepare a PR for it.

I'd really appreciate that. Could you please make it configurable in DynamoDBConnectionManager and DynamoDBSubscriptionManager?

Now I'm really thinking about some simple website for documentation. Something like react-hook-form has, because keeping this in README is not really maintainable and developer friendly.

fridaystreet commented 4 years ago

@michalkvasnicak yeah react-hook-form is nice site hey, just found that the other week

IslamWahid commented 4 years ago

@michalkvasnicak @fridaystreet may you please review the PR, so we can get it published ASAP

michalkvasnicak commented 4 years ago

Marking this as closed by #90 for now.

@IslamWahid great work, thank you!