vigna / fastutil

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues.
Apache License 2.0
1.78k stars 196 forks source link

[FeatureRequest] Have methods on CharCollection and friends that actually do codepoint conversion #204

Closed techsy730 closed 3 years ago

techsy730 commented 3 years ago

Although CharCollection (and friends) have methods like intIterator which convert the chars into ints for iteration (and ultimately, an intStream). However, this does not do codepoint conversion, it leaves surrogate pairs as separate elements. See https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html and https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#supplementary

This behavior should remain as by contract these methods must have a one to one relation between each char element and each int element it is casted to.

However, for CharCollection and friends only, it might be useful to introduce codepoint versions of these widening methods. That will combine such paris and return a such a sequence/stream/whatever of true code points. Like codepointIterator, codepointStream, etc.

However supplementary characters and surrogate pairs and stuff are notoriously mind bending to handle correctly, so we would want to introduce this with care.

vigna commented 3 years ago

I'm not entirely sure an algorithmic library of containers is the right place where to do this...

techsy730 commented 3 years ago

True. We could introduce a collector into StringBuilder or String instead, and then let them use the codepoint methods there. It would introduce an extra copy, yes, but also frees us from the UCF-16 uglyness.