ocaml-batteries-team / batteries-included

Batteries Included project
http://ocaml-batteries-team.github.com/batteries-included/hdoc2/
Other
519 stars 106 forks source link

BatRope.icompare #148

Closed yoriyuki closed 13 years ago

yoriyuki commented 13 years ago

Reading batRope.ml, I found that BatRope.icompare is implemented such a way that it converts each character to its lowercase individually, then compare it.

However this does not work if

  1. the case map is not 1 character to 1 character. An example is German eszett.
  2. the case map is locale sensitive. An example is Turkish dotted and dot-less "i" and there could be more examples.

The way of case-less comparison recommended by Unicode Standard is use case-folding algorithm. Camomile implements it as compare_caseless function in CaseMap module.

http://camomile.sourceforge.net/dochtml/CamomileLibrary.CaseMap.Type.html#VALcompare_caseless

yoriyuki commented 13 years ago

BatRope.lowercase and BatRope.uppercase have the same issue.

thelema commented 13 years ago

Thank you for noticing this. I've changed to using the casefolding function instead of the lowercase function from CaseMap, but I'm still having problems.

# module C = BatCamomile.CaseMap.Make(BatCamomile.UTF8);;
...
# C.casefolding "FOO";;
- : C.text = "foo"
# C.casefolding "foo";;
- : C.text = ""
# C.compare_caseless "foo" "FOO";;
- : int = -1

I don't understand the issue with Rope.lowercase/uppercase - are you saying they should use case folding, or that they need to have a parameter for locale?

Thanks for your help improving batteries.

yoriyuki commented 13 years ago

Well, this is a bug of Camomile. Thank you for pointing out. I will release the bug fix soon.

# module C = BatCamomile.CaseMap.Make(BatCamomile.UTF8);;
...
# C.casefolding "FOO";;
- : C.text = "foo"
# C.casefolding "foo";;
- : C.text = ""
# C.compare_caseless "foo" "FOO";;
- : int = -1
yoriyuki commented 13 years ago

Concerning BatRope.uppercase and BatRope.lowercase, they are implemented using bulk_fold. Unfortunately, Greek Sigma changes its lowercase character depending on the context, so it would cause the rare and hard to find problem.

Since case mapping algorithm is defined as a functor over modules which implement UnicodeString.Type(http://camomile.sourceforge.net/dochtml/CamomileLibrary.UnicodeString.html), I would suggest that you make a subset of Rope which satisfies that interface, then pass it to CaseMap.Make functor to obtain case mapping functions. In this way, you can have, say, collation easily too.

thelema commented 13 years ago

I push these fixes back on Ulib, as we're planning on depending on it for unicode text.