OpenGrok should support source files with multibyte characters (e.g. japanese etc.) (Bugzilla #437)

vladak commented 11 years ago

status ACCEPTED severity enhancement in component misc for --- Reported in version unspecified on platform Other Assigned to: Knut Anders Hatlen

On 2008-01-30 13:20:52 +0000, Roland Mainz wrote:

RFE: OpenGrok should support source files with multibyte characters (e.g. japanese etc.), for example for C code encoded in ja_JP.UTF-8 (Sun Studio supports this via the "-xcsi" option) or shell scripts in such languages (for example ksh93 support function-, variable-etc.-names with multibyte characters).

Notes for the implementation:

This isn't very hard - technically you have to define the character encoding to the matching java stream converter and you're done (figuring out the encoding may not always be easy, see below).

Figuring out a source file's character encoding should work like this (ordered in the way how the lookup code should work):

There should be a way to define the character encoding on a per-file basis

Some SCM systems like Subversion support extra properties like "mimetype" and "charset", AFAIK it may be a good idea to grab these values if possible

If no specific information is available use the character encoding of the systems current default locale (or alternatively assume UTF-8 encoding since ASCII can be treated as a subset of UTF-8 (e.g. at least UTF-8 encoded source files would work and it won't break existing source files))

On 2008-02-13 01:07:49 +0000, Knut Anders Hatlen wrote:

* Bug 508 has been marked as a duplicate of this bug. *

On 2008-06-17 01:57:20 +0000, Matthieu FEREYRE wrote:

This bug is still present in the OPengrok version 0.6.1. Could you give an estimate of the resolution's date ?

On 2008-06-17 02:17:02 +0000, Knut Anders Hatlen wrote:

No one has volunteered to implement this yet. Feel free to post a patch! :)

On 2008-10-16 13:28:04 +0000, wrote:

Would this work? http://mail.opensolaris.org/pipermail/opengrok-discuss/2007-June/000834.html

On 2008-10-16 13:49:20 +0000, Roland Mainz wrote:

(In reply to comment # 4)

Would this work? http://mail.opensolaris.org/pipermail/opengrok-discuss/2007-June/000834.html

No, "multibyte" means more than Unicode/UTF-8. And not all UTF-8 encoded files have a BOM or similar identifiers.

On 2008-10-16 15:09:01 +0000, Knut Anders Hatlen wrote:

I think we need to do something like this (each step could be done separately):

1) Make the analyzers use Readers instead of InputStreams to move away from the assumption that char==byte, and use the system's default encoding for all files

2) Teach all the analyzers that they shouldn't ignore characters whose code-point is >=127 (something similar to what's described in bug # 508, comment # 1)

3) Provide ways to specify encoding (per project, per file, auto-detected, etc...)

On 2008-10-16 16:07:39 +0000, Roland Mainz wrote:

(In reply to comment # 6) [snip]

2) Teach all the analyzers that they shouldn't ignore characters whose code-point is >=127 (something similar to what's described in bug # 508, comment

1)

Erm... this is not correct for shell scripts. At least ksh93 explicitly supports non-ASCII variable and function names. And Sun's Studio compiler supports non-ASCII identifiers using the -xcsi option, allowing multibyte charactes in ISO C code.

AFAIK in both cases you want to use "character classes" (e.g. see ksh93(1), tr(1) or egrep(1)) in the parser code to determinate the character class (e.g. "alpha", "alnum", "punct") of a character.

On 2010-01-27 14:35:34 +0000, pooh wrote:

Please note that contrary to what the Description suggests, this affects many more natural languages than "just" those using Unicode (>255). In fact, only 7-bit English and possibly a few other natural languages are currently supported, because ASCII codes >127 are unceremoniously thrown away. This affects most European languages with diacritics.

Not just that, but also some programming languages (e.g. APL and some derivatives) also use operators in the (128-255) range, which makes not only comments, but also code unreadable.

To me it looks like point 2) as described in Comment # 6 is a low-hanging fruit that would make many people happy

2) Teach all the analyzers that they shouldn't ignore characters whose code-point is >=127

Finally, I don't understand comment # 7.

Thanks.

On 2010-03-30 18:52:38 +0000, Knut Anders Hatlen wrote:

The change suggested as step 1 in comment # 6 was performed a while ago as part of another fix. I just now checked in code for step 2 (changeset efff150e83ce). Now non-ascii characters should be visible in the xrefs. However, nothing is (yet) done to recognize symbols with non-ascii characters. Also, text files are still assumed to be encoded with the default system encoding (whatever java.util.Locale.getDefault() returns).

craigkovatch commented 8 years ago

What's the status on this? I can't tell if it's supposed to be working or not.

kahatlen commented 8 years ago

I think the comment from 2010-03-30 is still accurate:

Now non-ascii characters should be visible in the xrefs. However, nothing is (yet) done to recognize symbols with non-ascii characters. Also, text files are still assumed to be encoded with the default system encoding (whatever java.util.Locale.getDefault() returns).

tarzanek commented 8 years ago

also #1037 tests this problem but not fix for now, you need to set proper locale before running indexer ...

craigkovatch commented 8 years ago

Am I understanding correctly that if the system locale is set correctly, the indexes will be built in a sane way, but they are not searchable?

tarzanek commented 8 years ago

They should also be searchable with proper encoding from browser.

The original idea is to be utf8 compatible everywhere. So subsets of utf8 should play nice already and with utf8 locale set.

That said this approach needs more testing. As you can see japanese chars step out of this concept. With what chars do you have a prob? Got sample file for tests?

tarzanek commented 7 years ago

there is a test for this #1037 we shall get this fixed somehow

oracle / opengrok

OpenGrok should support source files with multibyte characters (e.g. japanese etc.) (Bugzilla #437) #595

1)