Decode text in any charset (utf-16 and others)

soujiro32167 commented 4 years ago

Similar to fs2.text.utf8Decode, I would like to auto-detect and decode text in other charsets

soujiro32167 commented 4 years ago

Looks like thats been around in http4s for a while! Thanks @rossabaker https://github.com/http4s/http4s/commit/32eaa8f82d541fe6de5877aeadf66f8ec9b81c41. This does not auto-detect, but handles multi-byte boundaries across chunks. I ended up using https://github.com/albfernandez/juniversalchardet for auto detection

vasilmkd commented 4 years ago

I'm not entirely sure if automatic detection is feasible. I would however like to work on this. Which charsets should be prioritised. According to Google, the most used character sets are ASCII, ISO 8859-1 and UTF-8, which is already supported.

rossabaker commented 4 years ago

I've been meaning to contribute http4s' for a while. The main thing holding me back was being a bit more rigorous on the testing. The big thing holding me back from being rigorous on the testing is that it's a pain to create valid generators for arbitrary charsets. Ours also has an outstanding bug, where it's stripping BOM markers outside the initial byte.

I think @vasilmkd is on the right track focusing on a few. The three you mentioned are the big ones. I've also seen Windows-1252 a lot in the wild, though it's in decline. My experience is biased toward English, so I also found stats of unknown credibility. It would be good to choose something not Western European.

There are six guaranteed to be on the JVM, and testing any others sacrifices portability. In practice, test environments are likely to come with all the significant ones, though you could skip the test where the charset is not found.

You might be more efficient than the Java charset encoders by handwriting the single-byte codecs, but the http4s one should give a good start on the general problem.

I would hesitate to do autodetection in fs2, because of the liability of another dependency. A microlibrary that autodetects using juniversalchardet could be nice, and could delegate to this decoder once detected.

vasilmkd commented 4 years ago

@rossabaker Thanks for linking the StandardCharsets. I think it would be beneficial to support them. I also agree about the portability note.

Daenyth commented 4 years ago

FWIW, this is what we've been using. It doesn't support multibyte (other than utf-8), but it supports others:

val multibyteCharsets = Set(
  StandardCharsets.UTF_8,
  StandardCharsets.UTF_16,
  StandardCharsets.UTF_16BE,
  StandardCharsets.UTF_16LE)

def charsetPipe[F[_]](charset: Charset): Option[Pipe[F, Byte, String]] = {
  import StandardCharsets._
  charset match {
    case UTF_8 | US_ASCII                   => Some(fs2.text.utf8Decode[F])
    case c if multibyteCharsets.contains(c) => None
    case singleByteCharset =>
      Some(
        _.mapChunks(bs => Chunk.singleton(new String(bs.toBytes.toArray, singleByteCharset))))
  }
}

rossabaker commented 4 years ago

The charsetPipe could generate false positives for single byte. You could call maxCharsPerByte() on a charset decoder to be sure.

Creating scalacheck generators for all the charsets is an important part of testing. Unfortunately, there isn't a standard method on charsets that produces the relevant alphabet. The standard six are easy, but testing others will probably require some spec reading.

rossabaker commented 4 years ago

Actually, you can come close to deriving the alphabet for an arbitrary charset with canEncode from a CharsetEncoder. It just returns false negatives for surrogates, which is why there is a CharSequence version. And surrogates are where a lot of the corner cases are.

Daenyth commented 4 years ago

The charsetPipe could generate false positives for single byte

Yup. For our use case it was OK, but it's not great for a general tool.

maxCharsPerByte

neat, I didn't know that existed!

Would that then be something like

case c if c.newDecoder().maxCharsPerByte == 1 =>

(And cache the Decoder)

rossabaker commented 4 years ago

I'm not sure how expensive decoders are to create. You could iterate all available charsets on your JVM, create a decoder, and build a static set from there. I think JVMs can install charsets at runtime, but you're deep into the weeds by that point.

Daenyth commented 4 years ago

That's what I meant

soujiro32167 commented 4 years ago

Here is my current code:

  def stringToCharset(s: String): Either[CharsetError, Charset] =
    Try(Charset.forName(s)).toEither.leftMap(ex => CharsetError(ex.getMessage))

  def detectCharsetStream[F[_]](sampleSize: Long = sampleSize)(implicit F: Sync[F]): Pipe[F, Byte, Charset] = { s =>
    val detector = new UniversalDetector()
    s.take(sampleSize)
      .chunkAll
      .evalMap { sample =>
        detector.handleData(sample.toArray)
        detector.dataEnd()
        F.fromOption(Option(detector.getDetectedCharset), CharsetError("no charset detected"))
          .flatMap(stringToCharset _ andThen F.fromEither)
      }
  }

  def decode[F[_]: RaiseThrowable](charset: Charset): Pipe[F, Byte, String] =
    org.http4s.internal.decode(org.http4s.Charset.fromNioCharset(charset))

val s  = fs2.io.file.readAll[IO](path, Blocker.liftExecutionContext(ec), 4096)
val result = s.through(detector.detectCharsetStream()).flatMap(cs => s.through(decode(cs)))

susuro commented 4 years ago

A universal decoder is really missing in fs2. I would appreciate @rossabaker if you could contribute your solution from http4s. Even if it isn't as thoughtfully tested as you would like, it is probably still better than many of us throwing together our own solutions, even less rigorously tested. We can improve it later.

Daenyth commented 4 years ago

For reference, in http4s:

Some links that could be used to build a generator / sample text corpus: https://stackoverflow.com/questions/9190330/is-there-a-set-of-lorem-ipsums-files-for-testing-character-encoding-issues

typelevel / fs2

Decode text in any charset (utf-16 and others) #1928