purebred-mua / purebred-email

A fast email parsing library implemented in Haskell
https://hackage.haskell.org/package/purebred-email
GNU Affero General Public License v3.0
23 stars 4 forks source link

support more charsets #5

Closed frasertweedale closed 5 years ago

frasertweedale commented 6 years ago

Currently we only support us-ascii, iso-8859-1 and utf-8 charsets. But there are many more common charsets. Found in a corpus of my personal email were:

  -- , ("iso-8859-2", ...)
  -- , ("iso-8859-15", ...)
  -- , ("iso-2022-jp", ...)    (common)
  -- , ("windows-1252", ...)   (common)
  -- , ("windows-1256", ...)
  -- , ("cp1252", ...)         (same as windows-1256?)
  -- , ("big5", ...)           (common)
  -- , ("euc-kr", ...)
  -- , ("cp932", ...)
-- , ("gb2312", ...)         (Chinese)

And there are undoubtedly many more we need to support.

text-icu package is a binding to libicu with support for all common charsets. It does some things impurely (namely, loading converters). And its precise behaviour w.r.t. unrecognised charset names is not clear from the docs.

I'm unsure if we would want purebred-email to depend on text-icu, or if we're better off having pluggable charset support and a supplementary module for bringing the "expanded suite" via text-icu or some other means.

romanofski commented 5 years ago

The windows-1252 is one I run into a lot. I do feel text-icu is quite a big dependency for some of the common ones. I wonder how much it takes to support some of the common ones like windows-1252 and for the rest use text-icu?

frasertweedale commented 5 years ago

I've got upcoming, changes to both purebred-email and purebred to support this, as well as our first official plugin, purebred-icu :)