Open alanz opened 8 years ago
Apparently, ghc
8.0.2 is currently picky about the way unboxed byte arrays are written.
For instance, consider basic_typeclass.g
test. When I fiddle the way alex_base
is defined, the compiled executable produces segfaults in the generated code on Debian 9.0 x64. In particular, the following (which is generated currently), works
alex_base :: AlexAddr
alex_base = AlexA# "\xf8\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#
while this produces a segfault
alex_base :: AlexAddr
alex_base = AlexA# "\xf8\
\\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#
This could be due to CPP interacting badly with the backslash-end-of-line sequence, see https://downloads.haskell.org/~ghc/master/users-guide/phases.html#cpp-and-string-gaps
I think the haskell multi-line string syntax is bad, given standard linux backslash at end of line behaviour.
But, then, what would be the best way to split these long bytearrays into manageable chunks?
Well, knowing that there is a CPP preprocessor for them always gives you options. Just use the CPP standard, so you end up with
alex_base :: AlexAddr
alex_base = AlexA# "\xf8\\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\
\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\
\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#
On 11 June 2017 at 14:57, Sergey Vinokurov notifications@github.com wrote:
But, then, what would be the best way to split these long bytearrays into manageable chunks?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/issues/84#issuecomment-307627526, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZAB6b_Ke_pesgsh-p1bWJVC1Ej5Fxeks5sC-QhgaJpZM4HCQZX .
Apparently fixed by #107.
The CHANGELOG is silent about it, though.
At least in 3.2.6-generated files we do not see long hexstrings anymore, but long lists with just one entry per line.
Unfortunately the #107 fix was reverted back since it didn't play well with cpphs
which didn't like multiline string literals like
"foo\
\bar\
baz\
quux"
This is the original discussion that led to revert https://github.com/simonmar/alex/issues/116.
Perhaps cpphs is less of an issue nowadays and the fix could be restored? Thinking today about the cpphs
not being able to handle the backslashes at end of lines at the time I'm not sure that it was cpphs
doing the right thing and C preprocessors found in gcc
and clang
doing the worng thing. Given that standard C preprocessors could handle it I think cpphs
being unable to do so is/was a bug, but admittedly I haven't checked the spec.
Ah sorry, I was mistaken. The problem still exists for alex -g
. Without option -g
, it uses listArray
to construct the automaton table, with -g
it hacks it together with Addr#
and a huge string literal.
I suppose the -g
optimization would only affect the startup-time of the lexer (once for each startup of the executable containing the lexer). So how valuable is that optimization?
@simonmar would probably know more...
I did some benchmarking of fast-tags lexer on a set of hackage packages (up to 10000 Haskell files totalling 70Mb) and ghc bytestring literal is slightly but consistently better than arrays performancewise.
I also tried to mess around with generated lexers and manually switch boxed arrays to unboxed ones, replace indexing with unsafe indexing that doesn't do bounds checking and even replace arrays with vectors from the vector
package. The best I managed to get is a version that uses unboxed arrays in the generated lexer where possible and uses unsafe indexing (TestLexerArrayUnboxed
below).
My benchmark results as reported by tasty-bench
(look at X s ± Y ms
bits, total time is not relevant). TestLexerGhcOpt
is the lexer produced by alex -g
, TestLexerVanilla
is the lexer produced by alex
without any options, and TestLexerArrayUnboxed
and TestLexerVector
are my modifications of TestLexerVanilla
to try different arrays:
number of files = 9801
total text length = 71068208
All
TestLexerGhcOpt: OK (42.86s)
6.013 s ± 140 ms
TestLexerVanilla: OK (51.18s)
7.225 s ± 76 ms
TestLexerArrayUnboxed: OK (47.70s)
6.679 s ± 53 ms
TestLexerVector: OK (20.92s)
6.888 s ± 175 ms
However, runtime is not the only thing that changes when alex array backend is used. Big array constants that are generated when -g
is not passed tend to take a lot of time to compile and somehow produce significantly more (order of magnitude?) object code. For the results below, the object files for modules generated by alex (and some of them further modified by me to use different arrays) look like:
-rw-r--r-- 1 sergey sergey 39M Feb 2 14:52 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerArrayUnboxed.o
-rw-r--r-- 1 sergey sergey 1.4M Feb 2 02:23 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerGhcOpt.o
-rw-r--r-- 1 sergey sergey 38M Feb 2 02:24 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVanilla.o
-rw-r--r-- 1 sergey sergey 39M Feb 2 14:55 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVector.o
Even after stripping the difference is dramatic
-rw-r--r-- 1 sergey sergey 16M Feb 2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerArrayUnboxed.o
-rw-r--r-- 1 sergey sergey 763K Feb 2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerGhcOpt.o
-rw-r--r-- 1 sergey sergey 16M Feb 2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVanilla.o
-rw-r--r-- 1 sergey sergey 16M Feb 2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVector.o
I think the -g
flag is a clear winner so far and personally I wouldn't like to get 1. longer compile times, 2. blowup in executable size and 3. performance hit (10% at least) just to be able to very occasionally look at the modules generated by alex. Surely, I tested only one lexer which is somewhat involved but doesn't exercise all alex features and it's not clear how representative it is of the more general uses of alex. I also have not tested standard backends that alex comes with.
PS I can share my messy benchmarks if anyone is curious.
Great investigation, @sergv !
@andreasabel I have an idea. I propose to restore my previous commit that makes lines smaller at the cost of meddling with preprocessor - this should mostly work with recent gcc and clang. For others I'll add a flag for alex, similar to --ghc or --debug that will disable long lines and thus e.g. projects that use only cpphs could always pass this flag and it'll work for them. Please share any objections.
@sergv : Sorry for the delayed response and thanks for the investigation.
What you suggest could be a solution but I wonder whether this issue is worth any complications, as it is purely aesthetic.
this should mostly work with recent gcc and clang.
To verify this, we would have to diversify our CI.
I can take a look at restoring commit and diversifying CI. It wouldn't cause that much complications (see my old PR) logicwise, yet I recall being pissed whenever I had to open Alex generated files. I didn't open any lately and admittedly it's the editors who are wrong at not being able to handle files with very long lines. Yet in practice we have the editors we have and looks like the issue can be alleviated in alex. As a user I definitely did want this - that's why I did the PR before that fixed it.
I can take a look at restoring commit and diversifying CI.
I suggest to create another workflow (independent of the semi-autogenerated haskell-ci.yml
) that tests under Windows and macOS.
Some packages provide the generated files to ease installation. If a user wants to open these in an editor or run tooling over them, this presents challenges.
e.g. http://hackage.haskell.org/package/Agda-2.4.2.2/docs/src/Agda-Syntax-Parser-Lexer.html#lexer
Consider splitting the generated code over multiple lines for