[PATCH] UTF-8 characters corrupted once for every 32k text

irssibot commented 12 years ago

- - Issue #870 opened by @kcwu

irssi use TEXT_BUFFER to store all text. TEXT_BUFFER will internally maintain text with smaller "chunk".

However, if an UTF-8 character was split -- the head of the character is in one chunk and remain parts are in another chunk, the character will be treated as corrupted and display incorrectly.

How to reproduce:

Repeatedly input utf8 characters (more than 1 byte utf-8 character) to channel.
Roughly every 32k text, there will be some characters broken.

irssibot commented 12 years ago

- - Attachment 282 by @kcwu

diff-irssi-utf8.txt

Index: fe-common/core/utf8.h
===================================================================
--- fe-common/core/utf8.h   (revision 5189)
+++ fe-common/core/utf8.h   (working copy)
@@ -12,5 +12,6 @@
 int mk_wcwidth(unichar c);

 #define unichar_isprint(c) (((c) & ~0x80) >= 32)
+#define is_utf8_leading(c) (((c) & 0xc0) != 0x80)

 #endif
Index: fe-text/textbuffer.c
===================================================================
--- fe-text/textbuffer.c    (revision 5189)
+++ fe-text/textbuffer.c    (working copy)
@@ -23,6 +23,7 @@
 #include "module.h"
 #include "misc.h"
 #include "formats.h"
+#include "utf8.h"

 #include "textbuffer.h"

@@ -157,6 +158,16 @@
        if (left > 0 && data[left-1] == 0)
            left--; /* don't split the commands */

+       /* don't split utf-8 character. (assume we can split non-utf8 anywhere. */
+       if (left < TEXT_CHUNK_USABLE_SIZE && !is_utf8_leading(data[left])) {
+           int i;
+           for (i = 1; i < 4 && left >= i; i++)
+               if (is_utf8_leading(data[left - i])) {
+                   left -= i;
+                   break;
+               }
+       }
+
        memcpy(chunk->buffer + chunk->pos, data, left);
        chunk->pos += left;

irssibot commented 11 years ago

- - Comment 1636 by @bazerka

This patch is broken and results in sporadic segfaults. See #875, #877.

irssibot commented 11 years ago

- - Comment 1638 by @kcwu

Interesting. My friend and me have used this patch for years without any crashes. Sorry to cause trouble to others. I will follow up this issue.

irssibot commented 11 years ago

- - Comment 1649 by @kcwu

This is revised patch.

irssibot commented 11 years ago

- - Attachment 291 by @kcwu

diff-irssi-utf8-2.txt

Index: src/fe-common/core/utf8.h
===================================================================
--- src/fe-common/core/utf8.h   (revision 5189)
+++ src/fe-common/core/utf8.h   (working copy)
@@ -12,5 +12,6 @@
 int mk_wcwidth(unichar c);

 #define unichar_isprint(c) (((c) & ~0x80) >= 32)
+#define is_utf8_leading(c) (((c) & 0xc0) != 0x80)

 #endif
Index: src/fe-text/textbuffer.c
===================================================================
--- src/fe-text/textbuffer.c    (revision 5189)
+++ src/fe-text/textbuffer.c    (working copy)
@@ -23,6 +23,7 @@
 #include "module.h"
 #include "misc.h"
 #include "formats.h"
+#include "utf8.h"

 #include "textbuffer.h"

@@ -154,6 +155,17 @@
         chunk = buffer->cur_text;
    while (chunk->pos + len >= TEXT_CHUNK_USABLE_SIZE) {
        left = TEXT_CHUNK_USABLE_SIZE - chunk->pos;
+
+       /* don't split utf-8 character. (assume we can split non-utf8 anywhere. */
+       if (left < len && !is_utf8_leading(data[left])) {
+           int i;
+           for (i = 1; i < 4 && left >= i; i++)
+               if (is_utf8_leading(data[left - i])) {
+                   left -= i;
+                   break;
+               }
+       }
+
        if (left > 0 && data[left-1] == 0)
            left--; /* don't split the commands */

irssibot commented 10 years ago

- - Comment 1665 by @staili

Irssi: Client: irssi 0.8.15 (20100403 1617)

Tokavikan kirjan perusteella oletan että tyyppi saattaa tiet��ä mistä puhuu tossa vikassa.

irssibot commented 10 years ago

- - Attachment 300 by @staili

skandit_sarki.png

870_bd02177062cc89778ae94f5205

irssibot commented 9 years ago

- - Comment 1675 by @henrisalo

This issue should be closed. Handled in https://github.com/irssi/irssi/pull/12

irssibot commented 9 years ago

- - Closed by @Geert

This task has been relocated to Github @ https://github.com/irssi/irssi/pull/12

irssi-import / bugs.irssi.org

[PATCH] UTF-8 characters corrupted once for every 32k text #870