haskell / alex

A lexical analyser generator for Haskell
https://hackage.haskell.org/package/alex
BSD 3-Clause "New" or "Revised" License
297 stars 82 forks source link

Get rid of overly long lines in generated .hs file #84

Open alanz opened 8 years ago

alanz commented 8 years ago

Some packages provide the generated files to ease installation. If a user wants to open these in an editor or run tooling over them, this presents challenges.

e.g. http://hackage.haskell.org/package/Agda-2.4.2.2/docs/src/Agda-Syntax-Parser-Lexer.html#lexer

Consider splitting the generated code over multiple lines for

alex_base
alex_table
alex_check
alex_dflt
alex_accept
sergv commented 7 years ago

Apparently, ghc 8.0.2 is currently picky about the way unboxed byte arrays are written.

For instance, consider basic_typeclass.g test. When I fiddle the way alex_base is defined, the compiled executable produces segfaults in the generated code on Debian 9.0 x64. In particular, the following (which is generated currently), works

alex_base :: AlexAddr
alex_base = AlexA# "\xf8\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#

while this produces a segfault

alex_base :: AlexAddr
alex_base = AlexA# "\xf8\
                   \\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#
simonmar commented 7 years ago

This could be due to CPP interacting badly with the backslash-end-of-line sequence, see https://downloads.haskell.org/~ghc/master/users-guide/phases.html#cpp-and-string-gaps

alanz commented 7 years ago

I think the haskell multi-line string syntax is bad, given standard linux backslash at end of line behaviour.

sergv commented 7 years ago

But, then, what would be the best way to split these long bytearrays into manageable chunks?

alanz commented 7 years ago

Well, knowing that there is a CPP preprocessor for them always gives you options. Just use the CPP standard, so you end up with


alex_base :: AlexAddr
alex_base = AlexA# "\xf8\\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\
\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\
\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#

On 11 June 2017 at 14:57, Sergey Vinokurov notifications@github.com wrote:

But, then, what would be the best way to split these long bytearrays into manageable chunks?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/issues/84#issuecomment-307627526, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZAB6b_Ke_pesgsh-p1bWJVC1Ej5Fxeks5sC-QhgaJpZM4HCQZX .

andreasabel commented 2 years ago

Apparently fixed by #107. The CHANGELOG is silent about it, though.
At least in 3.2.6-generated files we do not see long hexstrings anymore, but long lists with just one entry per line.

sergv commented 2 years ago

Unfortunately the #107 fix was reverted back since it didn't play well with cpphs which didn't like multiline string literals like

"foo\
\bar\
baz\
quux"

This is the original discussion that led to revert https://github.com/simonmar/alex/issues/116.

Perhaps cpphs is less of an issue nowadays and the fix could be restored? Thinking today about the cpphs not being able to handle the backslashes at end of lines at the time I'm not sure that it was cpphs doing the right thing and C preprocessors found in gcc and clang doing the worng thing. Given that standard C preprocessors could handle it I think cpphs being unable to do so is/was a bug, but admittedly I haven't checked the spec.

andreasabel commented 2 years ago

Ah sorry, I was mistaken. The problem still exists for alex -g. Without option -g, it uses listArray to construct the automaton table, with -g it hacks it together with Addr# and a huge string literal.

I suppose the -g optimization would only affect the startup-time of the lexer (once for each startup of the executable containing the lexer). So how valuable is that optimization?
@simonmar would probably know more...

sergv commented 2 years ago

I did some benchmarking of fast-tags lexer on a set of hackage packages (up to 10000 Haskell files totalling 70Mb) and ghc bytestring literal is slightly but consistently better than arrays performancewise.

I also tried to mess around with generated lexers and manually switch boxed arrays to unboxed ones, replace indexing with unsafe indexing that doesn't do bounds checking and even replace arrays with vectors from the vector package. The best I managed to get is a version that uses unboxed arrays in the generated lexer where possible and uses unsafe indexing (TestLexerArrayUnboxed below).

My benchmark results as reported by tasty-bench (look at X s ± Y ms bits, total time is not relevant). TestLexerGhcOpt is the lexer produced by alex -g, TestLexerVanilla is the lexer produced by alex without any options, and TestLexerArrayUnboxed and TestLexerVector are my modifications of TestLexerVanilla to try different arrays:

number of files   = 9801
total text length = 71068208
All
  TestLexerGhcOpt:       OK (42.86s)
    6.013 s ± 140 ms
  TestLexerVanilla:      OK (51.18s)
    7.225 s ±  76 ms
  TestLexerArrayUnboxed: OK (47.70s)
    6.679 s ±  53 ms
  TestLexerVector:       OK (20.92s)
    6.888 s ± 175 ms

However, runtime is not the only thing that changes when alex array backend is used. Big array constants that are generated when -g is not passed tend to take a lot of time to compile and somehow produce significantly more (order of magnitude?) object code. For the results below, the object files for modules generated by alex (and some of them further modified by me to use different arrays) look like:

-rw-r--r-- 1 sergey sergey  39M Feb  2 14:52 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerArrayUnboxed.o
-rw-r--r-- 1 sergey sergey 1.4M Feb  2 02:23 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerGhcOpt.o
-rw-r--r-- 1 sergey sergey  38M Feb  2 02:24 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVanilla.o
-rw-r--r-- 1 sergey sergey  39M Feb  2 14:55 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVector.o

Even after stripping the difference is dramatic

-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerArrayUnboxed.o
-rw-r--r-- 1 sergey sergey 763K Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerGhcOpt.o
-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVanilla.o
-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVector.o

I think the -g flag is a clear winner so far and personally I wouldn't like to get 1. longer compile times, 2. blowup in executable size and 3. performance hit (10% at least) just to be able to very occasionally look at the modules generated by alex. Surely, I tested only one lexer which is somewhat involved but doesn't exercise all alex features and it's not clear how representative it is of the more general uses of alex. I also have not tested standard backends that alex comes with.

PS I can share my messy benchmarks if anyone is curious.

andreasabel commented 2 years ago

Great investigation, @sergv !

sergv commented 2 years ago

@andreasabel I have an idea. I propose to restore my previous commit that makes lines smaller at the cost of meddling with preprocessor - this should mostly work with recent gcc and clang. For others I'll add a flag for alex, similar to --ghc or --debug that will disable long lines and thus e.g. projects that use only cpphs could always pass this flag and it'll work for them. Please share any objections.

andreasabel commented 1 year ago

@sergv : Sorry for the delayed response and thanks for the investigation.

What you suggest could be a solution but I wonder whether this issue is worth any complications, as it is purely aesthetic.

this should mostly work with recent gcc and clang.

To verify this, we would have to diversify our CI.

sergv commented 1 year ago

I can take a look at restoring commit and diversifying CI. It wouldn't cause that much complications (see my old PR) logicwise, yet I recall being pissed whenever I had to open Alex generated files. I didn't open any lately and admittedly it's the editors who are wrong at not being able to handle files with very long lines. Yet in practice we have the editors we have and looks like the issue can be alleviated in alex. As a user I definitely did want this - that's why I did the PR before that fixed it.

andreasabel commented 1 year ago

I can take a look at restoring commit and diversifying CI.

I suggest to create another workflow (independent of the semi-autogenerated haskell-ci.yml) that tests under Windows and macOS.