Cache dump to disk - Githubissues

Ortovoxx commented 3 years ago

Currently creates a .sessions folder with just sessions.json and guild.json to then be restored at a later date.

Fails when constructing roles.

Added a cache options array for client options but might want to take out and just have it auto dump to disk with the hot reloading option

timotejroiko commented 3 years ago

i havent forgotten about this yet, im working on some other projects. I'll revisit this soon

Ortovoxx commented 3 years ago

Need to add the _unpatch method to the bot user and take a look at client options etc to do some cleaning up

timotejroiko commented 3 years ago

Nice job with the unpatch.

Here are my thoughts about it:
There is enough cache control as is, there is no need for further cache control when rehydrating, lets just dump everything that is currently cached and restore everything. Its the only way to make sure no weird things happen and its also the closest thing to a true session reload, zero data loss.

With that said, the unpatch method should be moved to the client, and should loop over users, channels and guilds.
From each user, get the raw data according to the api, for each channel to the same, and for each guild do the same except for guild channels.
Guild channels should be rehydrated from the channel cache as a reference since they are already linked by reference anyway.

As for the exit event, i dont feel there is a need for that. if the user wants custom cache dumping behavior they should add their own event listeners to exit, sigint, etc... and use the newly exposed client.unpatch method. We only need a single function to dump all, and a single client option to specify the source to rehydrate all (can be file path or object).

All related client options can then be moved into a single object, something like this:

Client({
    hotreload: {
        enabled: true,
        sessionData: { // defaults to ".sessions/sessions.json", can also be passed an object like this
            0: { id: "wef23bf", sequence: 4234 },
            1: { id: "wef23bf", sequence: 4234 }
        },
        unpatchOnExit: true, // if false, users can use client.unpatch in their own exit listeners
        // onUnpatch: cache => {} // alternatively replace the above with something like this
        patchSource: "./cache.json" // defaults to ".sessions/cache.json", can also be passed a cache object for people who use redis etc...
    }
})

We have two possibilities here, we could keep sessions and cache separate, each with its own dump method and restore option, or simplify it even further and put everything together.

Another issue is getting all of this to be compatible with the sharding manager, an example being that processes cannot write concurrently to the same file, but this can be dealt with later on.

Generally the direction i want to follow is to try and make this as simple as possible, nothing too complex that would change discord.js too much. Any super-advanced feature would be better suited for a future standalone library.

Ortovoxx commented 3 years ago

Simplifying it down is a good idea and it's better to use the client options already presented when dumping as you say it saves weird things from happening and keeps it simple.

Although I think that there should still be a private unpatch method for channels, guilds and members which will return that objects raw discord api data and then in the client there should be a public unpatch method which loops over all of the objects we want to cache and calls their private unpatch method.

Hydrating like that sounds like a good solution though.

Yes I agree with the exit events it's better to leave the customisation up to the user I see you added a client option in for turning our native disk caching off which is good. Using the client.unpatch to get all the cache is a better solution as it allows for more extension by users.

In terms of how the cache should be stored it's tricky. I think that sessions should be stored separate to the cache as it allows for potential swapping of caches and sessions across machines etc. The simplicity of this will not matter as much as the user will have limited exposure to the .sessions folder.

Potentially we create a guilds, channels and users subfolder (or file prefix) and store their data in a file mapped by their IDs. Ie we have .sessions/guilds/581072557512458241.json with all the data about that guild. I don't know about the efficiency of writing all of these files to disk like that but is an alternative way of storing them instead of having 1 massive JSON file. Might work better if people want to restore cache across shards when resharding? Might also solve some problems with the sharding manager later down the line as they won't have to write to the same file

I agree that we should wait until we have internal sharding working and then look at the sharding manager and make changes to work with that.

timotejroiko commented 3 years ago

yes, private unpatch for channel user and guild is good.

we can try going with the one file per id and see how it will works out. performance will likely be atrocious for very large bots, but at that point the user should switch to a better backend anyway so it should be fine.

lets try something like this for the client options:

Client({
    hotreload: {
        enabled: true,
        sessionData: { // if not provided, load sessions from .sessions/sessions/[id].json
            0: { id: "wef23bf", sequence: 4234 },
            1: { id: "wef23bf", sequence: 4234 }
        },
        cacheData: { // if not provided, load data from .sessions/guilds/[id].json etc...
            guilds: { ... },
            channels: { ... },
            users: { ... }
        },
        onUnload: sessions => { // if not provided, store sessions in json files
            // else do something with sessions
        },
        onUnpatch: data => { // if not provided, store data in json files
            // else do something with data
        }
    }
})

it kinda goes against what i said before about the exit event, but maybe it will turn out to be simpler like this after all.

we dont need to support file paths since if the user wants a custom file, they might as well just load it themselves then just give us the object

Ortovoxx commented 3 years ago

Okay that all looks good! Although I will remove the enabled property as users can disable it by setting hotreload to false and enable it by setting it to an object with all of their settings.

Maybe better to combine the dumping of sessions and data into 1 function?

I'll start implementing some features over the next week or so

timotejroiko commented 3 years ago

could be, but then there would be no way of turning them off individually, a single function would shut down the json files option for both, which might not be a big deal tbh, maybe it will be good to keep it all-or-nothing like everything else in the lib

Ortovoxx commented 3 years ago

True, thinking about it there doesn't seem to be a use case where someone would want to use one but not the other. We'll stick with the 1 for now and can split them into two separate functions if the need arises later

Ortovoxx commented 3 years ago

Just added some helper methods and general scaffolding. Need to now add in the storing of cache on exit and the patching of cache back in on startup. Will probably have a work on it next weekend

this._patchCache() Needs to take in { guilds, channels, users } and rehydrate the caches. The cache supplied is either the user supplied cache or the cache from disk if the user supplies none. Shouldn't error if no cache supplied.

this.dumpCache() Should take a snapshot of all the cache and return it in discord api form( but not do anything with it hence why it's public for people to add their own cache storage solution ). Need to go through channel, guild and user to create a _unpatch() method for each which will return their data in discord api form. Just loop through everything in cache and call that method for each.

this.onUnload() Should call dumpCache and store all of that to disk. Users can override this function with their own cache storage mathods

Ortovoxx commented 3 years ago

Going to need your help as to were to place the _unpatch() methods on the user and channels as well as writing the rehydration part as I don't think I have enough knowledge in how all discord.js types and classes interact.

Just the _unpatch() methods and cache rehydration to go.

timotejroiko commented 3 years ago

we can use the sharding formula for guilds yes, good idea. channels can be obtained from their guildID, falling back to shard 0 if they dont have one. users should be tied to members, as members are loaded, so will their users. in fact maybe we dont need a user cache at all, i'll look into it

timotejroiko commented 3 years ago

the cache storing should be fine like this for now, we dont actually need DM users nor DM channels. GuildMembers already include the user data, and we can include the channel data from guilds as well.

Now we need to move the loading of the cache to the websocket manager, so the loading only happens after we have a list of shard IDs. From there we either pick matching data from the user-supplied cache object, or we load the matching data from the json files.

timotejroiko commented 3 years ago

should be mostly ready for testing, let me know if theres anything i missed

Ortovoxx commented 3 years ago

There is a slight problem. Because we only fetch the web socket shard session data from disk inside identify() the web socket manager is currently queueing up shards like it normally would and is not fast connecting. The only way to fix this would be to know how many shards that manager is expected to handle at start up and whether we can correctly resume for them meaning we would have to read all the data from disk at start up.

I think that we definitely want to keep a way to fast connect as if you have lots of shards you would be waiting unnecessary time.

Fixed with https://github.com/timotejroiko/discord.js-light/pull/41/commits/76f762225ffbc9ecdf3861ce5b091c86b17f6e89

Ortovoxx commented 3 years ago

Fixed a bunch of the minor issues and tested it. Seems to work pretty smoothly so far!

We need to consider what we do when a client increases their shard size and still provides session data as currently if 2 shard sessions are stored and then the user increases their shards to 3 and loads those sessions which resumes 2 and identifies for 1 more. This is obviously not the desired behaviour as we now have 2 sessions handling all guilds and then 1 more session handling 1/3 of guilds.

Potentially we can just force identify all shards if a different shard count is present when compared to the number of stored shards. However this presents a problem for when a user downshards ( as there will be redundant shards stored on disk ) or if a user simply wants the behavior mentioned above ( for a user potentially wanting to increase the sessions handled by a client )

We can solve this by adding a hotReload.forceIdentify option but the same effect can be achieved by not providing any data ( but would mean guild cache is not loaded )

Lots of the problems that arise from this are edge cases I understand but the first issue is definitely not and most users would encounter a problem here. I will leave it up to you to decide how you with to handle it.

timotejroiko commented 3 years ago

There is a slight problem. Because we only fetch the web socket shard session data from disk inside identify() the web socket manager is currently queueing up shards like it normally would and is not fast connecting. The only way to fix this would be to know how many shards that manager is expected to handle at start up and whether we can correctly resume for them meaning we would have to read all the data from disk at start up.

I think that we definitely want to keep a way to fast connect as if you have lots of shards you would be waiting unnecessary time.

Fixed with 76f7622

I have a suggestion:

Move the session loading to the manager. We need the session to check if we should resume or reidentify, we already have a shard id in createShards(), so we take that id and load the session file there if revelant.

timotejroiko commented 3 years ago

Fixed a bunch of the minor issues and tested it. Seems to work pretty smoothly so far!

We need to consider what we do when a client increases their shard size and still provides session data as currently if 2 shard sessions are stored and then the user increases their shards to 3 and loads those sessions which resumes 2 and identifies for 1 more. This is obviously not the desired behaviour as we now have 2 sessions handling all guilds and then 1 more session handling 1/3 of guilds.

Potentially we can just force identify all shards if a different shard count is present when compared to the number of stored shards. However this presents a problem for when a user downshards ( as there will be redundant shards stored on disk ) or if a user simply wants the behavior mentioned above ( for a user potentially wanting to increase the sessions handled by a client )

We can solve this by adding a hotReload.forceIdentify option but the same effect can be achieved by not providing any data ( but would mean guild cache is not loaded )

Lots of the problems that arise from this are edge cases I understand but the first issue is definitely not and most users would encounter a problem here. I will leave it up to you to decide how you with to handle it.

We can store the total shards in each session object as well since the session ID literally depends on it. IF the shard file exists but the total shards in the file is different than the total shards in the client, then reidentify instead of resume

Ortovoxx commented 3 years ago

These are both better ideas. Just added them along with some extra debug logs. Works like a charm

https://github.com/timotejroiko/discord.js-light/pull/41#issuecomment-812840110 Resolved by https://github.com/timotejroiko/discord.js-light/pull/41/commits/7280818342e76baeb7c16c009b9174cce089bcb4

https://github.com/timotejroiko/discord.js-light/pull/41#issuecomment-812840261 Resolved by https://github.com/timotejroiko/discord.js-light/pull/41/commits/a1607044e3fe35947de8f57750dfddaab5466eae

timotejroiko commented 3 years ago

this essentially needs to be started from scratch now, if there is still interest. However, djs maintainers also stated a certain interest in making something like this a thing in djs v14 or djs-next, so im not sure if its worth working on this.

timotejroiko / discord.js-light

Cache dump to disk #41