squidowl / halloy

IRC application written in Rust
GNU General Public License v3.0
1.18k stars 40 forks source link

`chathistory` Support #370

Open andymandias opened 1 month ago

andymandias commented 1 month ago

Work in progress PR to add IRCv3 CHATHISTORY support (i.e. #206). Since it's not a widely available feature yet, the goal is to minimize the any effects on Halloy's operation when CHATHISTORY is not available. Testing on ircs://irc.ergo.chat:6697/ to start.

Currently (more-or-less) implemented:

tarkah commented 1 month ago

@andymandias I'll look to review this very soon :eyes:

andymandias commented 1 month ago

Thanks @tarkah! It's definitely still a work in progress, in particular query buffers don't have chathistory support yet. There are also a number of smaller issues to work out (will try to add these to the list at the top of the PR). I am currently using it with soju (timestamp-based) and ircs://ergo.chat/#ergo (msgid-based) and it ~works.

Things I think particularly need reviewing:

andymandias commented 1 month ago

Brief addendum to note it looks like my next push will involve a moderate restructuring, so probably best to review after (hopefully tonight).

andymandias commented 1 month ago

I'll be chipping away at this still, but it's probably in a good spot for a WIP review now.

tarkah commented 1 month ago

I'm a bit unsure about all the API changes to history to support this feature. I need to internalize it all more, but it seems to me we shouldn't have so many new code paths. All we're doing is feeding messages to history like we would any other message that comes in, the only difference is we may want to splice it into place vs appending it.

So it seems to me the only API change of history needed would be the following:

It'd be great to sync up some time to discuss this over IRC so I can better understand the model at a high level as there's a lot to internalize. I'm also seeing async functions getting called in from main but w/out awaiting which is a no-op.

tarkah commented 1 month ago

@andymandias how does this feature interact w/ bouncers that send a replay buffer on connect? Are the bouncers programmed to only send one or the other, or will both get sent? If both are sent, do we just rely on our dedupe strategy to eliminate dupes, or do we have some other mechanism to handle this?

andymandias commented 1 month ago

@tarkah the ircv3 description of the specification says that the replay buffer should not be sent automatically when a client has negotiated chathistory. In my experience with soju that is the case (ZNC, which I've switched off of, does not support chathistory to my knowledge). I think we should be prepared to potentially get dupe messages around join time (even if they follow ircv3 specification, I think we could end up with some dupes if - for example - a message is sent to the channel right after we join but before the server receives our associated chathistory request). But, my feeling at the moment is that our dedupe strategy is sufficient for that purpose.

andymandias commented 1 month ago

Another restructuring, to better allow for using LATEST then repeated BETWEENs to update when joining a channel. The old scheme using repeated AFTER would fail to receive any messages when the reference message was no longer in the server's available message history. The restructuring should also make it easier to utilize the TARGETS subcommand, which I plan to work on next.

Should be a bit clearer in intent and use than before, but it doesn't change any of the problem areas of the PR (History and asynchronicity).

andymandias commented 1 month ago

Still testing, but at the moment this is what I consider feature complete. There are a couple additions that probably warrant explanation.

tarkah commented 2 weeks ago

@andymandias @casperstorm Let's find some time to discuss the scope and desired UX of this feature. The PR is now ~2k LOC and makes a lot of API changes and I don't want to dive in and start making suggestions or changes until we are all aligned on scope & UX.

casperstorm commented 2 weeks ago

@andymandias @casperstorm Let's find some time to discuss the scope and desired UX of this feature. The PR is now ~2k LOC and makes a lot of API changes and I don't want to dive in and start making suggestions or changes until we are all aligned on scope & UX.

We almost needed a RFC for this PR 😅 Perhaps, @andymandias, you could do a small writeup of some of this PR including some decisions you have made along the way. I would love to read something like that before digging into this monster.

andymandias commented 1 week ago

@casperstorm @tarkah I may not be able to commit much code to this PR for a bit, but I will do my best to explain the intended features along with an overview of the significant implementation decisions made in service of those features. I'm going to try and keep it relatively high level to avoid getting lost in the weeds, but will be available to answer any follow-up questions you may have.

As the PR currently stands, the main features are to use chathistory to do the following:

  1. When connecting to a server, get all new messages since the last login. This is intended to, as much as possible, "just work" (no user interaction necessary). This is currently implemented via three mechanisms:
    • A TARGETS request is made when chathistory support is acknowledged by the server. TARGETS is used to discover any queries that were made while the client was not connected, and then chathistory requests for the latest messages in those queries are made.
    • Any time a channel is joined, a request is made via chathistory for the latest messages in that channel.
    • A request for the latest message in any queries that are open on launch.
    • For all three mechanisms above, "latest messages" means all messages since the most recent message in the query/channel history (potentially requiring multiple chathistory requests). If there are no messages in the history, then a single LATEST chathistory request is made. The maximum number of messages of a request is set by the server, but I also set a client-side maximum at 500 messages (not attached to that particular number). TARGETS uses a targets_marker to set its request boundaries, more on that later.
  2. When the user reaches the end of a channel/query history, they can request older messages from that channel/query. Currently this is either done automatically (whenever the user scrolls to the very end of the history; there is an option to turn off this functionality), or manually (by clicking a button that has been placed at the very end of the history). This submits one chathistory request for messages in the channel/query before the earliest message that exists in its client-side history.

I'm not opposed to adding further features, but I wanted to keep the scope of this PR as minimal as possible.

There are two major changes to History to support these features:

  1. Messages are no longer stored in the order they are received. Instead the messages list is sorted based on the message's server_time, and new messages are inserted accordingly. Sorting is primarily done to enable deduplication, since the possibility of duplicate messages are an expected part of messages requested via chathistory. New messages are checked to be duplicates against messages with similar server_time, based on the message's id (if available) or the message's server_time and contents (otherwise, excepting some special cases). @tarkah deserves essentially all of the credit for this, and none of the blame for any aspect that may be broken.
  2. A History now has a read_marker instead of an opened_at. read_marker is essentially an opened_at that is written to the filesystem, and it serves a similar purpose. So, the backlog marker is placed via read_marker in a history in nearly the exactly manner as it was done via opened_at. But a read_marker allows for the determination of whether a message is "new" at the time it is inserted into a History::Partial. Since a History::Partial cannot be expected to have messages available for deduplication, duplication detection cannot be used as a proxy for message newness. (I don't think we want to load message history every time a new message is received, in order to check message newness.) Instead, the read_marker is checked against any arriving message's server_time in order to only trigger_unread when the server_time is newer than the read_marker. The read_marker is updated the exact same manner that opened_at was, except that it is also saved to disk. (This has the side benefit of persisting that information across program close/open.)
    • I added a new file associated with each server/channel/query history, in which the read_marker is written. I didn't want to store with the message history, since their main purpose is to avoid loading that history and I wasn't sure how to partially load a compressed history (or partially compress a history file). I took this as an opportunity to make the stored history naming less opaque (i.e. to name history files based on server/buffer.json.gz rather than a hash). Renaming the histories does have one function use; we can expect histories with the old naming schemes will not have been written sorted (so we should sort them on load), but histories with the new naming scheme will presumably be stored sorted (and won't need to be sorted on load). We could just sort everything though, and continue with hashes for all of the files.

Going back briefly to the first main feature: I mentioned TARGETS uses a targets_marker to build its request. That is, it requests queries/channels that have a new message since the time specified in the targets_marker. The targets_marker is the same data as a read_marker, but it is updated whenever:

The goal here is to be fairly conservative. Don't update the targets_marker too readily, in order to reduce the likelihood of missing a message. But also, don't be extremely conservative, otherwise we'll end up requesting messages for old queries (which results in reopening them in the client, even though nothing new has been sent). If no targets_marker exists, then the start of Unix epoch is used.

That's everything that comes mind at the moment, so I think it might be best to stop here and field questions.