Use node buffers and decode full lines

kiwiirc / irc-framework

🛠️ A better IRC framework for node.js. For bots and full clients.

MIT License

181 stars 64 forks source link

Use node buffers and decode full lines #198

Closed xPaw closed 5 years ago

xPaw commented 5 years ago

This is a change I noticed in #142 but I rewrote it myself (after looking at that PR, it doesn't appear to actually implement the buffering correctly)

This change will use buffers and will split it into lines from said buffer. iconv.decode is now called on a single IRC line instead of the whole data buffer passed from the socket (which I only assume could cause troubles on partial data).

I also no longer keep an array of lines and directly emit each split and decoded line. There's no need to keep an array because split is no longer used (it doesn't exist on Buffer either). This should lower memory consumption.

I also reset incoming_buffer when calling connect() instead of constructor, which should clear out any remaining data if the socket dies.

I tested it with ZNC playback where it feels a lot of data, and it works fine.

prawnsalad commented 5 years ago

The partial data has been a concern since this was put in place although I've never seen it reported that it's ever an issue. Happy to hear a better to handle it though.

This part is something that's been mentioned by a few people over the years though: this.incoming_buffer.indexOf('\n', startIndex). Some UTF8 (and most likely other encoding) characters do include the \n byte within them which would cause the splitting to occur half way through a character.

xPaw commented 5 years ago

do include the \n byte within them

That could be a problem on master as well though, right? EDIT: It probably won't be, as it decodes first.

I could search for \r\n, but that will break on networks that only send \n.

prawnsalad commented 5 years ago

Right, decoding happens first so searching \n will always specifically match a \n character. Also correct on point 2.

xPaw commented 5 years ago

I'm not sure how to solve this then. Ideas?

prawnsalad commented 5 years ago

The potential partial data issue? It's been a concern for several years and I'm not sure of a way around it despite a few more knowledgable encoding people looking over it over the years. Unless someone can mention a work around or reports of the potential issue actually occurring, then I can only leave it as it is.

xPaw commented 5 years ago

I asked in #ircv3, people over there just seem to split on \n over the raw buffer as well. I looked at hexchat code and didn't see anything interesting over there either.

Merge then?

prawnsalad commented 5 years ago

I really can't see any benefit with this PR, it's just switching a potential partial data issue with a slightly more likely potential issue with character splitting. There hasn't been any reports of the current implementation breaking either.

Am i missing something from the #ircv3 discussion yesterday?

xPaw commented 5 years ago

From what I gathered, I don't see the issue you are describing could ever arise.

If a client does happen to send such encoded data that happens to have 0x0A (\n) in there somewhere, the server would already split on that, and such a message would never reach other clients (without the server re-encoding the message at least).

Besides, what multibyte encoding can send 0x0A where it doesn't mean a new line? This can't happen in UTF-8.

xPaw commented 5 years ago

For reference why it can't happen in utf8: https://softwareengineering.stackexchange.com/questions/262227/why-does-utf-8-waste-several-bits-in-its-encoding

To add to that, it seems that Kiwi/irc-framework is alone to decode data from socket first. Other clients that I looked at all search for the new line character first.

EDIT: I looked at ZNC's code, and their ReadLine function also searches for new line character before doing any conversions. https://github.com/jimloco/Csocket/blob/e8d9e0bb248c521c2c7fa01e1c6a116d929c41b4/Csocket.cc#L2489-L2536

prawnsalad commented 5 years ago

After discussing and getting this explained clearer with some others that know way more about encoding than I, this seems fine to me now. Merging