haskell / alex

A lexical analyser generator for Haskell
https://hackage.haskell.org/package/alex
BSD 3-Clause "New" or "Revised" License
297 stars 82 forks source link

[ fixed #119 ] latin1 encoding: each byte counts as 1 char #156

Closed andreasabel closed 4 years ago

andreasabel commented 4 years ago

The computation of the length component of AlexToken was tailored to the utf8 encoding, and didn't work correctly for latin1.

This is fixed by having a new flag ALEX_LATIN1 in templates/GenericTemplate.hs that turns on code that increases the length by 1 for each byte, while for utf8 something more sophisticated is done.

The fix requires more template instances to be generated. To streamline the instance generation, now all 2^4 = 16 template instances are generated for the 4 flags

To ensure consistent reference to the template instance, a function

  templateFileName

residing both in src/Main and gen-alex-sdist/Main needs to be kept consistent, should more dimensions be added to the template.

(Putting this function into a separate file that is included by both modules could be an option, but seemed not enough in the spirit of cabal-organized projects.)

simonmar commented 4 years ago

Nice. Thanks!

mtolly commented 3 years ago

Hi, it looks like this (and some other merges) were not included in the recent Alex 3.2.6 release. Understandable since it was a stopgap for a GHC release.

This fix to the Latin-1 mode would be helpful in order to fix a language-c (and thus c2hs) issue: https://github.com/visq/language-c/issues/72

Any info on when a new release can happen with some of these PRs that have been merged since 3.2.5?

Ericson2314 commented 3 years ago

Yes, I suppose I should release another now that GHC is finally using 3.2.5. I did want to finish https://github.com/simonmar/alex/pull/174 first, I guess I should get on that.