Closed fridaystreet closed 4 years ago
been thinking this through a bit and would I be correct in thinking that a ttl on the subscriptions and operations table would be just as well placed as the events table wouldn't it?
I mean the subscriptions and operations are null and void after 2 hours if AWS is going to ditch the connections anyway. Or maybe I missed something as to why they are needed to be kept?
Cheers Paul
@fridaystreet you're right, they should be removed but this library relied on AWS AG to call $disconnect
event, but it seems isn't guaranteed by AWS.
SubscriptionManager
to return only live connections, because that'd mean we need to fetch the status over and over again.Maybe the combination of both is way to go. Do you see any parts where it could be breaking change? TTL is not a breaking change but requires you as a developer to set it up. Unsubscribing during event processing could help everyone just by updating the version of a package.
same issue here!
I see that when the page with subscriptions is open, every 10 mins there's a new connection opened with its subscriptions and the old one is not used anymore and that keep just adding to the subscriptions table with no real connection on the connections table.
@michalkvasnicak good point on the id being reused. I'm not entirely sure. However, we have been running with just the TTL since I originally posted this and haven't experienced the issue again. I haven't had a chance to do any really in depth tracing or load testing, so possibly it could still be worth looking into a clean up during event processing as well, but certainly as a quick fix, the TTL is working well.
I don't think either TTL or the cleanup would cause a breaking change. Well for us it hasn't. Using Apollo client it just handles it and resubscribe to everything automatically.
Probably also worth noting, as per a previous conversation we were having about dynamo keys, we have made sure to use very specific subscription key names which include the ids of relevant objects (eg new_message_room_some_id), so as to ensure there are minimal reads from dynamo, rather than broad names like 'new_message' and then relying on subscription filtering post pulling all the subscriptions. This seems to scale and perform well so far.
@fridaystreet thanks for the info. If I find a time on the weekend I'll try to implement removal of dead connections stored in table (ones that were not removed by AWS AG automatically by invoking $disconnect
event). TTL is simplest solution so maybe it needs to be documented too.
Probably also worth noting, as per a previous conversation we were having about dynamo keys, we have made sure to use very specific subscription key names which include the ids of relevant objects (eg new_message_room_some_id), so as to ensure there are minimal reads from dynamo, rather than broad names like 'new_message' and then relying on subscription filtering post pulling all the subscriptions. This seems to scale and perform well so far.
Yes this is something we were discussing with @AlpacaGoesCrazy in #86 and also about using SQS as an event source.
@fridaystreet I took a look and when sending to connection fails, the connection should be deleted and all of its subscriptions. So the problem (as you wrote) is in the event processing when processing of an event fails, then we never try to send anything to connection, so we don't know whether it's stale or not. We can't remove the connection in this case, because the error doesn't necessarily need to be caused by stale connection. So in case of an error we should try to check if connection is stale or not.
@fridaystreet I'm not sure whether to check connection after event processing fails or before event is processed.
After event processing fails:
We will check the state of specific connection only if event processing failed.
Before an event is processed:
We will check the state of specific connection and only then we will process the event.
I think that for developers it's easier to think about event processing only on active connections. But on the other hand, checking the state of connection on each event can result in API throttling from AWS side.
@michalkvasnicak yeah I agree with you and after reading your comments, it would make sense to only check on failure I think. However, that said, if the connection is canned (that is killed by telling api gateway to drop it which I think is what unregisterConnection is doing from memory) then the client will just resubscribe. Or at least in the case of apollo it will.
It's been a while since I was looking at this, but I think where I was going was that, if processing fails, just tell api gateway to kill the connection and delete it, then leave it up to the client to resubscribe.
But I can see where you're going with 'only check connection if it fails' I'm assuming somehow it would attempt to send a message and if it fails then delete it.
@fridaystreet
But I can see where you're going with 'only check connection if it fails' I'm assuming somehow it would attempt to send a message and if it fails then delete it.
There is an API getConnection()
on ApiGatewayManagementApi
. So I was thinking that on failure we'd just use this API to check if the connection exists and unregister it. sendToConnection()
already removes the connection and its subscriptions, see https://github.com/michalkvasnicak/aws-lambda-graphql/blob/master/packages/aws-lambda-graphql/src/DynamoDBConnectionManager.ts#L145.
and yes agree also, I think for developers there would be an assumption that it wouldn't be processing stale connections. Again I guess with that I was thinking that if the stale connections were filtered out first, then you aren't processing all those connections only to find out they failed at the end. There could be all sorts of db lookups and stuff going on to process an event, it sort of feels like the lookup, in the beginning, could be a bit outweighed by the overhead of processing an event for stale connections only to determine at the end that the processing was in vain.
Possibly it's even a case by case thing based on how much resources it takes to process an event. Maybe it's best to only process valid connections, or maybe it's lightweight enough that stale connections aren't an issue.
Dare I say... could it do both based on dynamic configuration? :-)
yep that's exactly what I was thinking, use unregisterConnection via send. Sorry just couldn't remember the exact line off top of my head.
I'm a bit balls deep in trying to get v2 of our product out the door at the mo, but that is close to completion. I can't really do much at the mo, but I'd be happy to commit some time to your project after that to help out on anything you need. The project has really been quite pivotal in what we wanted to achieve.
Dare I say... could it do both based on dynamic configuration? :-)
Yes I think this one can be based on configuration, it'll need some nice explanation in docs (maybe I should create some documentation website because README.md
is not really a good place to point out everything).
I'm leaning to make the check before processing as a default, even if it breaks how it works now, so maybe if someone uses this on high volume of events there will be some throttling from AWS :(. But I'll make some research based on AWS limits.
I'm a bit balls deep in trying to get v2 of our product out the door at the mo, but that is close to completion. I can't really do much at the mo, but I'd be happy to commit some time to your project after that to help out on anything you need. The project has really been quite pivotal in what we wanted to achieve.
Would you mind sharing the product (or after finish)? :) Do you use this library or have own solution inspired by it?
sorry for all the comments, it's provoking ideas. We've been running the ttl and it's made a massive improvement.
I reckon with the ttl on subscriptions etc and an attempted to send on event processing failure (effectively invoking unregisterConnection on failure), coupled with a filter on the dynamo call to get subscriptions to discard connections past the ttl it would be bang on the money.
I mention the filter because as you probably know, dynamo can take up to 48 hours to delete items past their ttl
Would you mind sharing the product (or after finish)? :) Do you use this library or have own solution inspired by it?
yeah anything that we do in relation to the library I'm happy to share and contribute. Our product isn't opensource, so I can't share IP, but as I say, if it touches your library or is related I obviously consider those opensource contributions. There's also a fair few things we've had to consider in terms of how to handle authentication and how to structure the event names to get the best performance from dynamo, so happy to share those as well.
oh sorry I just realised you were meaning just tell you what the product is. Yeah of course. it's called ALEiGN you can find it at http://aleign.com
all the info there references the current v1. It was an express based application built for an mvp. It wasn't massively scalable or consistent. v2 is a complete rewrite, based on serverless, graphql, and a new event-driven architecture. Along with the frontend being completely rebuilt.
The sales/marketing guys are working through updating all the screenshots and reference guides now
yeah anything that we do in relation to the library I'm happy to share and contribute. Our product isn't opensource, so I can't share IP, but as I say, if it touches your library or is related I obviously consider those opensource contributions. There's also a fair few things we've had to consider in terms of how to handle authentication and how to structure the event names to get the best performance from dynamo, so happy to share those as well.
I'd very appreciate that because I wasn't really using this library (I've changed a job so at the moment I'm working on something completely different).
oh sorry I just realised you were meaning just tell you what the product is. Yeah of course. it's called ALEiGN you can find it at http://aleign.com
all the info there references the current v1. It was an express based application built for an mvp. It wasn't massively scalable or consistent. v2 is a complete rewrite, based on serverless, graphql, and a new event-driven architecture. Along with the frontend being completely rebuilt.
The sales/marketing guys are working through updating all the screenshots and reference guides now
Thank you for sharing! It's good to know that someone uses or is inspired by this library. 👍
@michalkvasnicak @fridaystreet guys can you please explain it more
if the connections table is being cleaned correctly, so why its related subscriptions aren't being deleted in this case?
also, you're talking about 2 hours of AWS AG $disconnect event I experience something different I'm not sure if it's the same or not every 10 mins I see on the apollo client it tries to start a new connection and re-establish the subscriptions again I thought this is causing the increased number of subscriptions, can you please confirm if that's the same?
@IslamWahid yeah I see the same thing every 10 mins, I think you're right that the connections are re-established every 10 mins through api gateway, but that is invoking the $disconnect which is causing this library to clean up the connections and subscriptions automatically.
I think it might be the case that inactive connections are reset after 10 mins, but if there is activity the inactive counter is reset and it keeps alive. But 2 hours is the hard limit maybe? regardless of activity after 2 hours, connections are reset. This is just an assumption, you could create a ping that sends a message every minute and see if it stays alive past the 10 minutes?
I only ever see the number of connections currently active in the db and only subscriptions for those connections from what I can see. My main problem is that if other things fail like event processing like we've been discussing, there seems to be some gaps where $disconnect isn't invoked by AWS, meaning connections and subcriptions aren't cleaned up, hence adding the ttl.
@fridaystreet thanks for your reply and it would be great if you share your TTL quick solution.
Yes there is 10 minutes idle connection timeout https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html.
@IslamWahid I think that TTL index can be created on createdAt
column in connection table.
@fridaystreet do you apply TTL on subscriptions as well?
@IslamWahid because I haven't had time to submit a PR to @michalkvasnicak I've basically just extended his classes to add the extra functionality for now and created our own custom versions of the sub manager and conn manager
Remembering you'll also need to add the config to serverless file to create the tables with a ttl.
class CustomDynamoSubscriptionManager extends DynamoDBSubscriptionManager {
constructor(props) {
super(props);
this.subscribe = async (
names,
connection,
operation
) => {
const subscriptionId = this.generateSubscriptionId(
connection.id,
operation.operationId
);
// we can only subscribe to one subscription in GQL document
if (names.length !== 1) {
throw new Error('Only one active operation per event name is allowed');
}
const [name] = names;
await this.db
.batchWrite({
RequestItems: {
[this.subscriptionsTableName]: [
{
PutRequest: {
Item: {
connection,
operation,
event: name,
subscriptionId,
operationId: operation.operationId,
ttl: moment().add(2, 'hours').unix()
},
},
},
],
[this.subscriptionOperationsTableName]: [
{
PutRequest: {
Item: {
subscriptionId,
event: name,
ttl: moment().add(2, 'hours').unix()
},
},
},
],
},
})
.promise();
};
}
}
const subscriptionManager = new CustomDynamoSubscriptionManager({
dynamoDbClient,
subscriptionsTableName: config.graphql.subscriptions.subscriptionsTableName,
subscriptionOperationsTableName: config.graphql.subscriptions.subscriptionOperationsTableName
});
class CustomDynamoDBConnectionManager extends DynamoDBConnectionManager {
constructor(props) {
super(props);
this.registerConnection = async ({
connectionId,
endpoint
}) => {
const connection = {
id: connectionId,
data: { endpoint, context: {}, isInitialized: false },
};
await this.db
.put({
TableName: this.connectionsTable,
Item: {
createdAt: new Date().toString(),
id: connection.id,
data: connection.data,
ttl: moment().add(2, 'hours').unix()
},
})
.promise();
return connection;
};
}
}
const connectionManager = new CustomDynamoDBConnectionManager({
connectionsTable: config.graphql.subscriptions.connectionsTable,
apiGatewayManager: process.env.IS_OFFLINE
? new ApiGatewayManagementApi({
endpoint: 'http://localhost:4001',
})
: undefined,
dynamoDbClient,
subscriptions: subscriptionManager,
});
@IslamWahid I think that TTL index can be created on
createdAt
column in connection table.
would be nice, but the ttl has to be a unix timestamp
@fridaystreet @michalkvasnicak thanks for your help, guys! I would really like to get that in the library so if you don't have much time I may prepare a PR for it.
also, it would be great if we could prepare some guidelines and best practices like what @fridaystreet mentioned above to be documented.
@IslamWahid
I would really like to get that in the library so if you don't have much time I may prepare a PR for it.
I'd really appreciate that. Could you please make it configurable in DynamoDBConnectionManager
and DynamoDBSubscriptionManager
?
Now I'm really thinking about some simple website for documentation. Something like react-hook-form has, because keeping this in README is not really maintainable and developer friendly.
@michalkvasnicak yeah react-hook-form is nice site hey, just found that the other week
@michalkvasnicak @fridaystreet may you please review the PR, so we can get it published ASAP
Marking this as closed by #90 for now.
@IslamWahid great work, thank you!
Not entirely sure if this is an issue or just working as expected. I'm just trying to get my head around how the cleanup should work as I seem to have lots of old subscriptions. Looking through the cloudwatch logs it looks like all the old subscriptions for a particular event are getting pulled down each time, then discarded. But essentially it's making the lambda function run longer and longer each time.
My understanding was that subscriptions and subscription operations are removed when they are stale like connections. That doesn't seem to be the case, well at least in our implementation.
looking at the code base I can see that unsubscribeAllByConnectionId is called in unregisterConnection, which is called
I think there might be a couple of gaps there. Maybe it should be called when the Events processor is iterating through the events and executing the event for all the subscribers. If there is an error in that execution for a subscriber, I think maybe that should cause the subscription to be removed and the client informed so they can resubscribe.
Possibly in closeConnection in the connection manager, although my assumption here is that triggering the apigateway to close the connection is triggering the $disconnect, which in turn should be cleaning up.
So anyway, I'm not sure if there are maybe circumstances where relying on the 2 hour $disconnect from API gateway is failing through the cracks sometimes and subscriptions are left and never cleaned up. Possibly a check to see if the connection in the subscription is valid before executing it in the event processor? Or maybe that should be the role of the subscription manager, it shouldn't return any subscriptions to the event processor if they don't have valid connections?