typelevel / paiges

an implementation of Wadler's a prettier printer
Apache License 2.0
186 stars 30 forks source link

issue with unicode #187

Open johnynek opened 4 years ago

johnynek commented 4 years ago

we currently use String.length to see how many columns we have moved. e.g.

https://github.com/typelevel/paiges/blob/master/core/src/main/scala/org/typelevel/paiges/Chunk.scala#L108

https://github.com/typelevel/paiges/blob/master/core/src/main/scala/org/typelevel/paiges/Doc.scala#L299

https://github.com/typelevel/paiges/blob/master/core/src/main/scala/org/typelevel/paiges/Doc.scala#L579

But, for characters that don't fit in 16 bits, this will be incorrect. A possibly better, but more expensive, solution is to use: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointCount(int,%20int)

But there is also the question of halfwidth/fullwidth forms: https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

So, to be really pro-style, we need a way to take a String and return the width on the screen when it is printed.

johnynek commented 4 years ago

looks like this may work:

https://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html

from: https://engineering.linecorp.com/en/blog/the-7-ways-of-counting-characters/

public static int getGraphemeLength(String value) {
    BreakIterator it = BreakIterator.getCharacterInstance(); 
    it.setText(value); 
    int count = 0; 
    while (it.next() != BreakIterator.DONE) { 
        count++; 
    }
    return count;
}

We could have the render method take a trait StringLength { def apply(s: String): Int } and have three implementations which are probably in increasing cost: s.length s.countCodePoint(0, s.length) and the above.

If you know you have ascii, you could use the first algorithm...

Maybe we should just benchmark this and see what the cost is for ascii, and if doing the correct thing doesn't totally blow up costs, we can use it.

Since the text() constructor is already doing a traversal to look for \n, maybe the cost won't be significant...