Wrong column width for wide (CJK) characters

xlucn commented 2 years ago

Describe the bug When the project name or description contain wide characters, e.g., in CJK languages, the column width ends up about half what the text actually needs. That is, as if the text are only treated as half-width characters: Screenshot_2022-07-08_21-40-24

I am not sure if this was related to #142, if yes, then the issue was not completely solved.

To Reproduce I think the test case below can reproduce it. Tested with https://github.com/vit-project/vit/commit/e37a75da8ff0856b9b7efd5c226741d2892ee175

Run the test script, which creates two tasks, in Chinese and English
See the columns have normal width
Delete the task with English description
See the columns shorten to half

Expected behavior The columns reserve enough space for wide characters.

Test case If reproducing your issue requires any TaskWarrior setup, please produce a test case script.

#!/bin/sh

set -e
set -u

TMP_DIR="$(mktemp --directory --suffix .vit)"
VIT_DIR="$TMP_DIR/vit"
TASK_DIR="$TMP_DIR/task"
TASKRC="$TMP_DIR/taskrc"

echo "
This script will create a dummy TaskWarrior database and VIT configuration.

Press enter to continue...
"
read -r _

echo "Creating dummy task configuration..."
mkdir "$TASK_DIR"
cat > "$TASKRC" <<EOT
data.location=$TASK_DIR
EOT
export TASKRC

echo "Creating dummy VIT configuration..."
mkdir "$VIT_DIR"
touch "$VIT_DIR/config.ini"

echo "Adding some dummy tasks..."
task add 测试中文组成的任务描述文字 project:一个项目名称
task add "Delete this task and see the column width changes" project:a_long_project

echo "Complete!

Copy and paste the following two export statements into your current
shell sesison to activate the dummy installation.

export TASKRC=${TASKRC}
export VIT_DIR=${VIT_DIR}

When complete, you can simply exit the current shell session, or copy and paste
the following 'unset' commands to revert the current shell session to the
TaskWarrior/VIT defaults.

unset TASKRC
unset VIT_DIR
"

thehunmonkgroup commented 2 years ago

Fixing this would be a lot of work, as widths are calculated all over the place, including in most individual formatters.

Urwid does provide a native function to determine the width of a character:

from urwid.util import str_util
str_util.get_width(ord(u'x'))

So I'm sure fixing this is possible, but without a PR from somebody else, or sponsorship, we probably won't see any progress.

xlucn commented 2 years ago

Thanks for the reply and pointing out the direction. I once created an issue in tremc with similar problem. But that was very complex work.

Maybe solving it for vit will be easier, if the only requirement is to correctly calculate the width of some text? What I mean is the "columns" only display the text (AFAIK), so the major work is to find where the width of a string is calculated, and to return the correct width of it.

I did some quick research and test, urwid's calc_width function can do that already. So I tried some changes (see below), only involving task descriptions. And it works so far:

Screenshot_2022-07-09_11-16-29

I am not an expert in Python, though. Is this a good solution?

diff --git a/vit/formatter/description_count.py b/vit/formatter/description_count.py
index f03ef9b..19dc63c 100644
--- a/vit/formatter/description_count.py
+++ b/vit/formatter/description_count.py
@@ -1,10 +1,11 @@
 from vit.formatter.description import Description
+from vit.util import unicode_len

 class DescriptionCount(Description):
     def format(self, description, task):
         if not description:
             return self.empty()
-        width = len(description)
+        width = unicode_len(description)
         colorized_description = self.colorize_description(description)
         if not task['annotations']:
             return (width, colorized_description)
@@ -14,7 +15,7 @@ class DescriptionCount(Description):

     def format_count(self, colorized_description, task):
         count_string = self.format_annotation_count(task)
-        return len(count_string), colorized_description + [(None, count_string)]
+        return unicode_len(count_string), colorized_description + [(None, count_string)]

     def format_annotation_count(self, task):
         return " [%d]" % len(task['annotations'])
diff --git a/vit/formatter/description_desc.py b/vit/formatter/description_desc.py
index 130661c..b3fdec5 100644
--- a/vit/formatter/description_desc.py
+++ b/vit/formatter/description_desc.py
@@ -1,8 +1,9 @@
 from vit.formatter.description import Description
+from vit.util import unicode_len

 class DescriptionDesc(Description):
     def format(self, description, task):
         if not description:
             return self.empty()
         colorized_description = self.colorize_description(description)
-        return (len(description), colorized_description)
+        return (unicode_len(description), colorized_description)
diff --git a/vit/util.py b/vit/util.py
index 5fb0f51..acc2d61 100644
--- a/vit/util.py
+++ b/vit/util.py
@@ -4,6 +4,8 @@ import curses
 import shlex
 from functools import reduce

+from urwid.str_util import calc_width
+
 curses.setupterm()
 e3_seq = curses.tigetstr('E3') or b''
 clear_screen_seq = curses.tigetstr('clear') or b''
@@ -55,3 +57,6 @@ def file_to_class_name(file_name):

 def file_readable(filepath):
     return os.path.isfile(filepath) and os.access(filepath, os.R_OK)
+
+def unicode_len(string):
+    return calc_width(string, 0, len(string))

As for the execution time, it's not that slow, shouldn't be noticeable for a screen-full of text:

In [31]: s = "中文"*500  # this should fill a screen of 80x25

In [32]: %timeit len(s)  # this time doesn't increase with string length
40 ns ± 0.319 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [33]: %timeit str_util.calc_width(s, 0, len(s))  # but this increases linearly
8.12 µs ± 249 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

thehunmonkgroup commented 2 years ago

Seems to work on a quick test. I like the idea of the simple unicode_len() wrapper as this will allow us to easily alter the underlying implementation should we need to, for e.g. performance enhancement.

I do have some concerns about performance, in which case I'm wondering if adding a config option to enable this type of calculation would be in order -- then the folks who need it can enable with a trade off in performance, and those who don't need it can leave it disabled. If we do decide this is a good approach, then we'd need to implement the wrapper a bit differently than a simple util function, as it will need access to the config object to determine its behavior.

thehunmonkgroup commented 2 years ago

As a quick reference, this is a list of files that use the len() function somehow. Quite a few of those uses are to calculate string lengths, but not all:

vit/application.py
vit/autocomplete.py
vit/base_list_box.py
vit/color.py
vit/command_bar.py
vit/config_parser.py
vit/formatter_base.py
vit/formatter/depends_count.py
vit/formatter/description_count.py
vit/formatter/description_desc.py
vit/formatter/description.py
vit/formatter/description_truncated_count.py
vit/formatter/description_truncated.py
vit/formatter/__init__.py
vit/formatter/markers.py
vit/formatter/project_parent.py
vit/formatter/project.py
vit/formatter/tags_count.py
vit/formatter/tags.py
vit/formatter/uda_date.py
vit/formatter/uda_duration.py
vit/formatter/uda_indicator.py
vit/formatter/uda_numeric.py
vit/formatter/uda_string.py
vit/help.py
vit/key_cache.py
vit/list_batcher.py
vit/multi_widget.py
vit/process.py
vit/readline.py
vit/task_list.py
vit/task.py
vit/util.py

xlucn commented 2 years ago

I do have some concerns about performance, in which case I'm wondering if adding a config option to enable this type of calculation would be in order -- then the folks who need it can enable with a trade off in performance, and those who don't need it can leave it disabled.

That makes sense if it affects performance badly. Here are more profiles of the performance. Basically, calculating width of one string of length ~10000, or one thousand short strings only take less than one millisecond (~100 µs). These could provide an estimate of the typical time usage if someone has 1000 tasks.

In []: %timeit len("test"*10)
32.4 ns ± 0.375 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In []: %timeit unicode_len("test"*10)
224 ns ± 4.14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In []: %timeit unicode_len("中文"*10)
348 ns ± 2.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In []: %timeit unicode_len("中文"*10000)
162 µs ± 342 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

As a quick reference, this is a list of files that use the len() function somehow. Quite a few of those uses are to calculate string lengths, but not all:

Yes, those can be found by a simple grep command. I had already changed quite some of them in my fork, but might not have covered them all.

thehunmonkgroup commented 2 years ago

I'm not sure that count is accurate. My all report shows 4000 tasks, and I'm guessing the average 'length of all strings per task' would be pretty far above 10.

It would be helpful to know what columns will need this treatment, as clearly all will not.

xlucn commented 2 years ago

The absolute value may be different from a real user case, but it's just a scale estimation. The relative time, though, can be estimated to be about 10 * time using len.

More importantly, instead of looking at the absolute time, we should compare the time used by unicode_len function and the total time loading/rendering a report.

I tested with 4000 tasks (with vit 2.2.0 and the taskwarrior-test.sh script), the "Exec. time" reported in the header is 872 milliseconds (my computer has a AMD 4800H cpu and a fast NVMe SSD):

Screenshot_2022-07-13_16-26-18

I am not sure if vit counts any hidden columns, but calling unicode_len on those shown texts only uses 3.35 milliseconds, slightly more than 10 times of that for len, which is 0.24 milliseconds (I assume they are counted once):

In [13]: def test():
    ...:     for i in range(4000):
    ...:         unicode_len("一个项目名称")
    ...:         unicode_len("测试中文组成的任务描述文字测试中文组成的任务描述文字")
    ...:         unicode_len("5m")
    ...:

In [14]: %timeit test()
3.35 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [15]: def test_len():
    ...:     for i in range(4000):
    ...:         len("一个项目名称")
    ...:         len("测试中文组成的任务描述文字测试中文组成的任务描述文字")
    ...:         len("5m")
    ...:

In [16]: %timeit test_len()
241 µs ± 924 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So this function won't be affecting the total rendering time that much.

It would be helpful to know what columns will need this treatment, as clearly all will not.

Yes, that will help. Some number related columns might not need this. However, if any column (I am not sure) can be customized, then those can be any text, too.

thehunmonkgroup commented 2 years ago

I'm 99% confident that it only calculates len for fields displayed in the report.

Can you get back to me w/ a final list of the columns that will need this treatment?

xlucn commented 2 years ago

I think the most obvious ones are description, project and tags, which are string-type columns accepting random user input.

I am not sure which other columns can be customized, like active.indicator. However, users can avoid wide characters when changing them.

For now, vit does not support locales, so other columns like dates and pre-defined strings in my opinion are not affected. Besides those, numeric columns are free from this problem as well.

thehunmonkgroup commented 2 years ago

If somebody wants to work a PR up making the adjustments to those columns, I'll have a look at it. Just make sure you have a look at all the formatters for each column type.

vit-project / vit

Wrong column width for wide (CJK) characters #327