`chathistory` Support - Githubissues

andymandias commented 1 month ago

Work in progress PR to add IRCv3 CHATHISTORY support (i.e. #206). Since it's not a widely available feature yet, the goal is to minimize the any effects on Halloy's operation when CHATHISTORY is not available. Testing on ircs://irc.ergo.chat:6697/ to start.

Currently (more-or-less) implemented:

Request up to the latest 500 messages or all messages since latest message in channel history when joining a channel.
Request up to 500 messages to add to channel history when scrolling to the top of a channel buffer.
Basic message deduplication.

Planned:
[x] Enable event-playback to allow replay of channel events (JOIN/PART/QUIT/etc). HistServ appears to be non-standard, so all its messages are filtered out for now for parity of experience. ~~Convert HistServ PRIVMSG messages into the appropriate JOIN/PART/QUIT/etc messages.~~
[x] Ensure history is loaded before requesting latest messages after JOIN.
[x] Ensure messages received when starting Halloy are marked as new.
[x] Request messages when reconnecting to a server (without closing Halloy; current testing shows this is covered by JOIN messages request(s)).
[x] Improve UX for loading additional channel history (button at end of history & config option to automatically request when scrolling to top of history).
[x] Do not move the backlog divider when adding older messages to a buffer.
[x] Test multi-request LATEST functionality (artificially limit the size of batches).
[x] Improve performance of deduplication (utilize something like an indexmap for message history?). And/or, avoid the need for deduplication.
[x] Use TARGETS to open Query buffers for direct messages sent while disconnected.
[x] Add get latest messages & load older messages to Query buffers.
[x] Get message ids for user-sent messages.
[x] Set up logic for the possibility of a chathistory request timeout.

tarkah commented 1 month ago

@andymandias I'll look to review this very soon :eyes:

andymandias commented 1 month ago

Thanks @tarkah! It's definitely still a work in progress, in particular query buffers don't have chathistory support yet. There are also a number of smaller issues to work out (will try to add these to the list at the top of the PR). I am currently using it with soju (timestamp-based) and ircs://ergo.chat/#ergo (msgid-based) and it ~works.

Things I think particularly need reviewing:

I'm committing some crimes against asynchronicity with load_history_now and make_partial_now (starting in main.rs, going into manager.rs). I believe that functionality should instead be implemented by a chain of asynchronous commands (load history → get latest message), but I wasn't quite sure how to set up the chain. I set it aside to deal with later, but I suspect you may have an even better solution.
It's useful to have a Full history in order to have a clear reference message when adding messages to the history. I've been working on making all processes work with a Partial history, but in the meantime I've messed with the make_partial to allow from converting from a Full-with-unread-messages to a Partial history. Done my best to respect the existing structure, but it probably warrants review.

andymandias commented 1 month ago

Brief addendum to note it looks like my next push will involve a moderate restructuring, so probably best to review after (hopefully tonight).

andymandias commented 1 month ago

I'll be chipping away at this still, but it's probably in a good spot for a WIP review now.

tarkah commented 1 month ago

I'm a bit unsure about all the API changes to history to support this feature. I need to internalize it all more, but it seems to me we shouldn't have so many new code paths. All we're doing is feeding messages to history like we would any other message that comes in, the only difference is we may want to splice it into place vs appending it.

So it seems to me the only API change of history needed would be the following:

Did the message come in via history?
If yes, splice it into place using message reference

It'd be great to sync up some time to discuss this over IRC so I can better understand the model at a high level as there's a lot to internalize. I'm also seeing async functions getting called in from main but w/out awaiting which is a no-op.

tarkah commented 1 month ago

@andymandias how does this feature interact w/ bouncers that send a replay buffer on connect? Are the bouncers programmed to only send one or the other, or will both get sent? If both are sent, do we just rely on our dedupe strategy to eliminate dupes, or do we have some other mechanism to handle this?

andymandias commented 1 month ago

@tarkah the ircv3 description of the specification says that the replay buffer should not be sent automatically when a client has negotiated chathistory. In my experience with soju that is the case (ZNC, which I've switched off of, does not support chathistory to my knowledge). I think we should be prepared to potentially get dupe messages around join time (even if they follow ircv3 specification, I think we could end up with some dupes if - for example - a message is sent to the channel right after we join but before the server receives our associated chathistory request). But, my feeling at the moment is that our dedupe strategy is sufficient for that purpose.

andymandias commented 1 month ago

Another restructuring, to better allow for using LATEST then repeated BETWEENs to update when joining a channel. The old scheme using repeated AFTER would fail to receive any messages when the reference message was no longer in the server's available message history. The restructuring should also make it easier to utilize the TARGETS subcommand, which I plan to work on next.

Should be a bit clearer in intent and use than before, but it doesn't change any of the problem areas of the PR (History and asynchronicity).

andymandias commented 1 month ago

Still testing, but at the moment this is what I consider feature complete. There are a couple additions that probably warrant explanation.

I added read_marker to message histories to store the timestamp of the last read message in a channel/query. These operate very similar to opened_at, but the intention is to have a RFC 3339 timestamp that can be saved and loaded separate of the messages in a history. Then, when messages arrive we can know whether they trigger unread state without loading the message history and looking for a duplicate. I'm not aiming to implement draft/read-marker here, but the intention is to be usable for that feature. I have been a bit lazy in reading these synchronously using std::fs; I'm hoping they are small enough reads that that's acceptable. (As a side benefit, read_marker allows unread state to persist across application close/open.)
Since I was adding to the files stored in the history directory (to store the read_marker separate from the message history), I took this as an opportunity to tweak the message storage scheme. I've only done this because I recalled discussion about making the stored messages a bit easier for users to access; if it's out of scope for this I'm happy to revert it. (I would probably use a hash for the read_marker filename in that case.)

tarkah commented 2 weeks ago

@andymandias @casperstorm Let's find some time to discuss the scope and desired UX of this feature. The PR is now ~2k LOC and makes a lot of API changes and I don't want to dive in and start making suggestions or changes until we are all aligned on scope & UX.

casperstorm commented 2 weeks ago

@andymandias @casperstorm Let's find some time to discuss the scope and desired UX of this feature. The PR is now ~2k LOC and makes a lot of API changes and I don't want to dive in and start making suggestions or changes until we are all aligned on scope & UX.

We almost needed a RFC for this PR 😅 Perhaps, @andymandias, you could do a small writeup of some of this PR including some decisions you have made along the way. I would love to read something like that before digging into this monster.

andymandias commented 1 week ago

@casperstorm @tarkah I may not be able to commit much code to this PR for a bit, but I will do my best to explain the intended features along with an overview of the significant implementation decisions made in service of those features. I'm going to try and keep it relatively high level to avoid getting lost in the weeds, but will be available to answer any follow-up questions you may have.

As the PR currently stands, the main features are to use chathistory to do the following:

When connecting to a server, get all new messages since the last login. This is intended to, as much as possible, "just work" (no user interaction necessary). This is currently implemented via three mechanisms:
- A TARGETS request is made when chathistory support is acknowledged by the server. TARGETS is used to discover any queries that were made while the client was not connected, and then chathistory requests for the latest messages in those queries are made.
- Any time a channel is joined, a request is made via chathistory for the latest messages in that channel.
- A request for the latest message in any queries that are open on launch.
- For all three mechanisms above, "latest messages" means all messages since the most recent message in the query/channel history (potentially requiring multiple chathistory requests). If there are no messages in the history, then a single LATEST chathistory request is made. The maximum number of messages of a request is set by the server, but I also set a client-side maximum at 500 messages (not attached to that particular number). TARGETS uses a targets_marker to set its request boundaries, more on that later.
When the user reaches the end of a channel/query history, they can request older messages from that channel/query. Currently this is either done automatically (whenever the user scrolls to the very end of the history; there is an option to turn off this functionality), or manually (by clicking a button that has been placed at the very end of the history). This submits one chathistory request for messages in the channel/query before the earliest message that exists in its client-side history.

I'm not opposed to adding further features, but I wanted to keep the scope of this PR as minimal as possible.

There are two major changes to History to support these features:

Messages are no longer stored in the order they are received. Instead the messages list is sorted based on the message's server_time, and new messages are inserted accordingly. Sorting is primarily done to enable deduplication, since the possibility of duplicate messages are an expected part of messages requested via chathistory. New messages are checked to be duplicates against messages with similar server_time, based on the message's id (if available) or the message's server_time and contents (otherwise, excepting some special cases). @tarkah deserves essentially all of the credit for this, and none of the blame for any aspect that may be broken.
A History now has a read_marker instead of an opened_at. read_marker is essentially an opened_at that is written to the filesystem, and it serves a similar purpose. So, the backlog marker is placed via read_marker in a history in nearly the exactly manner as it was done via opened_at. But a read_marker allows for the determination of whether a message is "new" at the time it is inserted into a History::Partial. Since a History::Partial cannot be expected to have messages available for deduplication, duplication detection cannot be used as a proxy for message newness. (I don't think we want to load message history every time a new message is received, in order to check message newness.) Instead, the read_marker is checked against any arriving message's server_time in order to only trigger_unread when the server_time is newer than the read_marker. The read_marker is updated the exact same manner that opened_at was, except that it is also saved to disk. (This has the side benefit of persisting that information across program close/open.)
- I added a new file associated with each server/channel/query history, in which the read_marker is written. I didn't want to store with the message history, since their main purpose is to avoid loading that history and I wasn't sure how to partially load a compressed history (or partially compress a history file). I took this as an opportunity to make the stored history naming less opaque (i.e. to name history files based on server/buffer.json.gz rather than a hash). Renaming the histories does have one function use; we can expect histories with the old naming schemes will not have been written sorted (so we should sort them on load), but histories with the new naming scheme will presumably be stored sorted (and won't need to be sorted on load). We could just sort everything though, and continue with hashes for all of the files.

Going back briefly to the first main feature: I mentioned TARGETS uses a targets_marker to build its request. That is, it requests queries/channels that have a new message since the time specified in the targets_marker. The targets_marker is the same data as a read_marker, but it is updated whenever:

A TARGETS response is received in full to the server_time of the last message in the batch
A read_marker is updated on disk to the same value as the read_marker.

The goal here is to be fairly conservative. Don't update the targets_marker too readily, in order to reduce the likelihood of missing a message. But also, don't be extremely conservative, otherwise we'll end up requesting messages for old queries (which results in reopening them in the client, even though nothing new has been sent). If no targets_marker exists, then the start of Unix epoch is used.

That's everything that comes mind at the moment, so I think it might be best to stop here and field questions.

squidowl / halloy

`chathistory` Support #370