Word and character counts for Chinese

Morick66 commented 8 months ago

novelWriter is very user-friendly and also has some parts translated into Chinese. I recently discovered this project and liked it very much, and I contributed some Chinese translations. However, there is an issue for Chinese users:

Chinese, unlike English and other alphabetic languages, does not have the concept of letters, and the smallest unit for word count in Chinese is a character. In the current version of novelWriter, the word count includes 'characters' and 'words'. In current usage, for counting words in Chinese, one needs to look at 'characters', as 'words' does not make sense for counting Chinese characters. Additionally, in the current counting feature, 'words' count as a sentence in Chinese. I hope novelWriter can enhance the experience for Chinese users by improving this aspect.

vkbo commented 8 months ago

Yes, I'm aware there are some issues here. I'm happy to improve CJK support in general, as long as someone can assist with how to best implement it.

Currently, word count is the primary metric used in various places, with character and paragraph as secondary metrics. I'm wondering if adding a CJK flag in Preferences that treats text stats differently is the way to go? Then we can redefine the GUI elements and various logic according to this.

I would need someone to define what changes this mode should make though, and where.

vkbo commented 8 months ago

Incidentally, I'm redesigning the Project Details tool as well, where I want to add some more text analysis features. I know CJK character count has been requested before, and it's one of the additions I have in mind.

Basically, I want to define a tool box of text analysis and stats tools that the user can select from to generate a report for a given novel folder. This is something where user contributions may be very helpful. I will create a framework for it, and adding language-specific ones shouldn't be a problem.

vkbo commented 8 months ago

novelWriter is very user-friendly and also has some parts translated into Chinese. I recently discovered this project and liked it very much, and I contributed some Chinese translations.

On a side note, if you have the time to update the remaining missing translations (6% of the text), I can include a complete translation in the 2.2.1 release I'm planning really soon. Only complete translations are updated.

Morick66 commented 8 months ago

Yes, it might be a good solution to set different character count rules for different languages in the preferences. Besides, during my use in these past few days, I have also found some Chinese translations that do not fit well in the Chinese context. I hope I can re-translate all the content into Chinese again.

Morick66 commented 8 months ago

I don't know if you are familiar with languages like Chinese and Japanese, but I can give you a simple example:

这是一个很好的软件，我很喜欢。

This sentence means: "This is a very good software, I like it very much."

In the common Chinese character count, this sentence contains 13 characters.

In English, each word is separated by a space, so when using a character count feature not suitable for Chinese, this sentence would show as having 2 words. However, it has 13 characters. And in Chinese, the count of words is usually not considered. Perhaps in Chinese and similar languages, character counting doesn’t need to count words, but just the number of characters. In the current novelWriter, the 'characters' count is actually the number of Chinese characters, but the count for words is not the number of vocabulary words, but the number of sentences.

For example,

“喜欢 (like)” is one Chinese word, but it contains two characters,
“这 (this)” is also one Chinese word, but it contains one character.
“我很喜欢 (I like it very much)” is a Chinese sentence, and not a single “word.”

vkbo commented 8 months ago

Thanks for the clarification. I have a few questions to follow up:

Can we treat Chinese, Japanese and Korean in the same way here? Such that we can have a CJK option that alters novelWriter's behaviour if enabled?
Character count is clearly the primary metric. Is the current way characters are counted also usable for CJK, or do we need a different method?
Can we display the current "word count" as a sentence count? Is that a useful parameter?
Is the paragraph count still valid? It counts double line breaks.
What three counts can we show in the panel below the project tree where it now shows characters, words and paragraphs?

I propose that CJK mode will use character counts instead of word counts in the project tree, status bar, and in the writing statistics. Any other changes we need to make?

I would also like some input from @longqzh here too, the original translator of the Chinese translation.

vkbo commented 8 months ago

Yes, it might be a good solution to set different character count rules for different languages in the preferences. Besides, during my use in these past few days, I have also found some Chinese translations that do not fit well in the Chinese context. I hope I can re-translate all the content into Chinese again.

Currently, @longqzh has access to approve Chinese translations. @hebekeg is in charge of the Japanese translation.

Morick66 commented 8 months ago

For Chinese, there are a few concepts that might need some explanation:

'Word Count' (字数): The number of Chinese characters.
'Character Count' (字符数): The total number of characters, including Chinese characters and symbols.

Regarding your questions:

Japanese and Korean, like Chinese, are also based on character count for word statistics. However, I'm not particularly familiar with the specific word counting rules in Japanese since it uses both Kana and Kanji. Native speakers of Japanese and Korean might offer better suggestions on this matter.
As a native Chinese speaker, in all the editors I've used, the primary metric for Chinese text has been 'Word Count' (字数), with 'Character Count' (字符数) as a secondary metric. The current paragraph and line count in novelWriter could perhaps be an added feature.
Displaying 'Word Count' as 'Sentence Count' might not be the best approach. Strictly speaking, in Chinese, a sentence is primarily defined by the ending punctuation marks like periods and question marks, with commas (，) dividing what might be considered half a sentence.
There's no issue with the way paragraph count is measured, using double line breaks is a good method. After all, in Markdown syntax, a single line break is a line break, and a double line break creates a new paragraph.
In Microsoft Word, each Chinese character and each symbol are counted as one character. 'Character Count（字符数）' might also be a metric to consider — the total number of Chinese characters and symbols. In the panel below the project tree, displaying word count, character count, and paragraphs might be a good approach.
In the project tree, it would be better to display word count, character count, and paragraphs, while in the status bar and writing statistics, showing the word count would be more appropriate.

Morick66 commented 8 months ago

If the user is involved in submitting their work after completion, different submission platforms have different counting standards. Some platforms count the word count (only Chinese characters), while others count the character count (including Chinese characters and symbols). Therefore, for Chinese users, 'Word Count' and 'Character Count' might be the primary indicators for text statistics.

Morick66 commented 8 months ago

The regular expression for pure Chinese characters (excluding symbols) is as follows: [\u4e00-\u9fa5].

Morick66 commented 8 months ago

In the course of using it, I encountered another issue, which is that the Chinese input method has a candidate box. In NovelWriter, this candidate box obscures the existing input content, like this: The correct display should be like this (as in Microsoft Word):

vkbo commented 8 months ago

In the course of using it, I encountered another issue, which is that the Chinese input method has a candidate box. In NovelWriter, this candidate box obscures the existing input content

There is no code in novelWriter to handle this. It likely comes from the Qt library or the system itself. What operating system are you using?

vkbo commented 8 months ago

'Word Count' (字数): The number of Chinese characters.

Is this what's sometimes called CJK count?

Morick66 commented 8 months ago

In the course of using it, I encountered another issue, which is that the Chinese input method has a candidate box. In NovelWriter, this candidate box obscures the existing input content

There is no code in novelWriter to handle this. It likely comes from the Qt library or the system itself. What operating system are you using?

windows10，

Morick66 commented 8 months ago

'Word Count' (字数): The number of Chinese characters. Is this what's sometimes called CJK count?

It should be, now in novelWriter, 'characters' is counted as 'Character Count' (字符数)，

The total number of characters, including Chinese characters and symbols.”

vkbo commented 8 months ago

In the course of using it, I encountered another issue, which is that the Chinese input method has a candidate box. In NovelWriter, this candidate box obscures the existing input content

There is no code in novelWriter to handle this. It likely comes from the Qt library or the system itself. What operating system are you using?

windows10，

I plan to switch novelWriter to Qt6 some time this year. Probably in release 2.4. Hopefully that will help, but a bit of searching reveals the positioning of the candidate box is a problem for many apps. This requires a bit of research. I suggest you make a bug report from your comment so we can track it separately.

vkbo commented 8 months ago

'Word Count' (字数): The number of Chinese characters. Is this what's sometimes called CJK count?

It should be, now in novelWriter, 'characters' is counted as 'Character Count' (字符数)，

Great. Then it may be as simple as adding an option to change the word count algorithm to use CJK count instead. I know how to make such a counter.

Would you still want to label it "Word Count" in the English translation on the user interface, or would it be clearer if it said "CJK Count"?

Also, should the CJK Count vs Word Count setting be a per project setting, or for the whole app? Either is possible.

Morick66 commented 8 months ago

I consulted chatGPT, and it might be an issue with the QT5 package. Here is their response:

I found information that suggests issues with the position of the input method candidate window in Qt applications are not uncommon. These issues seem to occur across different platforms and environments, and can be related to several factors including high DPI settings, the version of the underlying system libraries, or specific configurations in the Qt environment.

One issue that was reported involved the candidate window appearing at an incorrect position in relation to the input control, particularly in high DPI settings. The problem seemed to resolve when the scale setting was changed to 100%, indicating that scaling factors can influence the candidate window's position. This issue was not specific to Electron-based applications but was observed in various environments including Ubuntu and Fedora, under different desktop environments like GNOME and KDE.

Another reported issue was with the IBus input method in a Qt-based terminal widget. The candidate window position was not updating correctly due to the absence of specific code that would update the IM's cursor position. The problem was resolved by adding code to ensure the IM's cursor position was updated appropriately when the output changed.

These cases suggest that the issue you're experiencing could be related to the way the Qt environment interacts with the input method framework and the specific configurations or scaling settings of your system. It might be helpful to check the scaling settings, look into how the input method's cursor position is being managed in the application, and ensure all relevant libraries and frameworks are up to date. Additionally, considering the input method's compatibility with the version of Qt being used could provide some insights.

vkbo commented 8 months ago

ChatGPT is not a reliable source. It is mixing up a bunch of different issues in that answer too. I'm familiar with the Linux issues with positioning. I will do some proper research.

Morick66 commented 8 months ago

ChatGPT is not a reliable source. It is mixing up a bunch of different issues in that answer too. I'm familiar with the Linux issues with positioning. I will do some proper research.

Thank you very much, I am not sure if this issue exists in the Linux system, as I am using the Windows 10 operating system.

vkbo commented 8 months ago

The Linux issue mostly has to do with determining the (0,0) coordinate of the screen on multi-screen setups where the monitors have different resolution or pixel scaling. AFAIK this is related to Wayland. This is much better supported in Qt6.

On Windows, I'm not sure what the issue is. But in general, the way Qt computes coordinates for pop up boxes, including menus, depend on getting information from the window manager in the operating system. The Qt5 library is not up to date with various such features, including high DPI scaling, light/dark OS settings, etc. A lot of this has been fixed in Qt6.

Anyway, please make a proper bug ticket on this so I can track it as a bug. It will be lost here as this one is about the word count.

longqzh commented 8 months ago

@ruixuan658 , @vkbo

As a native Chiense and Korean speaker, I'd like to redefine some conceptions in order to avoid confusion.

Conception A : character count with blank
- the sum of the number of characters and symbols. it excludes blank number
Conception B : character count without blank
- the sum of the number of characters, symbols and blanks.
Conception C : word count
- the block number between blanks

Can we treat Chinese, Japanese and Korean in the same way here?

No , because in Korean, there are blank between words, but in Chinese, there isn't any blank between words. By the way, it's VERY HARD to get the word count in Chinese. If you want to split the Chinese words in a sentence properly, you have to use some professional library like jieba.

So Conception C makes no sense in Chinese. And currently Word Count in novelwriter means sentence number because there is only blank between Chinese sentence : )

Character count is clearly the primary metric. Is the current way characters are counted also usable for CJK, or do we need a different method?

For Chinese, Conception A and Conception B is enough. For Korean, Conception C is also necessary. I don't think we need other new method.

Can we display the current "word count" as a sentence count? Is that a useful parameter?

If we don't change the algorithm about "word count", in Chinese it means sentence count naturally. But I don't think it's a useful parameter, I have never seen it in others normal editor ( I'm not profesional editer's user )

Is the paragraph count still valid? It counts double line breaks.

I'm not sure. To be honest, maybe we should force the user to use double line breaks between paragraphes.

What three counts can we show in the panel below the project tree where it now shows characters, words and paragraphs?

characters and paragraphs should be shown in panel, if possible maybe we should distinguish Conception A and Conception B in charaters.
And words isn't easy to calcualte in Chinese. I don't think we should show it.

Generally speaking, I prefer to just show characters count in Chinese and hide the word count. For Korean and Japanse, treat them as English.

vkbo commented 8 months ago

Can we treat Chinese, Japanese and Korean in the same way here?

No , because in Korean, there are blank between words, but in Chinese, there isn't any blank between words. By the way, it's VERY HARD to get the word count in Chinese. If you want to split the Chinese words in a sentence properly, you have to use some professional library like jieba.

I really want to avoid adding an external library here. Both because I like to avoid dependencies whenever possible, and because the word count algorithm is performance critical, and I need to have control over how it's optimised.

So my real question is, will a CJK character count suffice? It's a fairly easy counter to implement.

For Chinese, Conception A and Conception B is enough. For Korean, Conception C is also necessary. I don't think we need other new method.

How necessary? The CJK count will ignore spaces. I propose to use CJK count as a drop-in replacement for word count, and just remove word count when it is enabled.

Basically I would like to add a new setting in Preferences called "Primary Counter" or something, with the following options:

Latin Word Count
CJK Character Count

Generally speaking, I prefer to just show characters count in Chinese and hide the word count. For Korean and Japanse, treat them as English.

With an option like above, the user could select their preferred counter. I'm also working on adding text analysis tools which will include a bunch of different counter methods, so CJK count will always be available, but this is mostly about which metric is used for collecting statistics and displaying in the project tree.

vkbo commented 8 months ago

So my real question is, will a CJK character count suffice? It's a fairly easy counter to implement.

We also need to discuss how such a counter is implemented. Python uses utf-8, so the absolute simplest option is to count Unicode values between \u4e00 and \u9fff. I'm not sure if that is sufficient, or more ranges should be included?

See Wikipedia: CJK Unified Ideographs

Edit: There is also the list at the bottom of this wiki page. I am not familiar enough with these languages to know what would be the best approach, or if this is even the right track.

longqzh commented 8 months ago

@vkbo

So my real question is, will a CJK character count suffice? It's a fairly easy counter to implement.

For Chinese and Korean, character count is enough. I also think word couting is pointless, especially given the complexity of the implementation :)

How necessary? The CJK count will ignore spaces. I propose to use CJK count as a drop-in replacement for word count, and just remove word count when it is enabled.

I agree with you. character count is much more importance than word count.

vkbo commented 8 months ago

Just to clarify, CJK character count is not the same as the current character count.

longqzh commented 8 months ago

We also need to discuss how such a counter is implemented. Python uses utf-8, so the absolute simplest option is to count Unicode values between \u4e00 and \u9fff. I'm not sure if that is sufficient, or more ranges should be included?

To clarify Python use Unicode to represent all character in the python virtual machine Official Docs. When we try to output the text (save to hard disk, send to network, etc), we need to encode the text to a given specification (ascii, utf-8, unicode, ...) So please recheck it 😄

vkbo commented 8 months ago

So please recheck it 😄

Python 3 strings default to utf-8 always. This is what the count is run on. This is not an issue.

The question is what Unicode ranges we need to check for CJK count.

vkbo commented 8 months ago

In any case, I have full control over the encoding in novelWriter. Both novelWriter and the Qt framework use Unicode (utf-8 and utf-16 respectively). What I need input on is what Unicode ranges will cover the correct symbols needed to make a proper count. I don't know the differences between the ones listed on that wiki page.

vkbo / novelWriter

Word and character counts for Chinese #1657