Open aolney opened 6 years ago
The following cabal files in pandoc's transitive dependencies contain c-sources stanzas. In some cases they may only be used for testing or benchmarking.
time.cabal:72: c-sources: lib/cbits/HsTime.c
time.cabal:179: c-sources: test/unix/Test/Format/FormatStuff.c
aeson.cabal:156: c-sources: cbits/unescape_string.c
aeson.cabal:166: c-sources: cbits/unescape_string.c
cryptonite.cabal:240: C-sources: cbits/cryptonite_chacha.c
cryptonite.cabal:273: C-sources: cbits/decaf/p448/arch_ref64/f_impl.c
cryptonite.cabal:284: C-sources: cbits/decaf/p448/arch_32/f_impl.c
cryptonite.cabal:296: C-sources: cbits/curve25519/curve25519-donna-c64.c
cryptonite.cabal:298: C-sources: cbits/curve25519/curve25519-donna.c
cryptonite.cabal:315: c-sources: cbits/cryptonite_rdrand.c
cryptonite.cabal:321: C-sources: cbits/aes/x86ni.c
cryptonite.cabal:326: C-sources: cbits/aes/generic.c
cryptonite.cabal:331: C-sources: cbits/blake2/sse/blake2s.c
cryptonite.cabal:337: C-sources: cbits/blake2/ref/blake2s-ref.c
cryptonite.cabal:346: C-sources: cbits/argon2/argon2.c
basement.cabal:158: c-sources: cbits/foundation_mem.c
network.cabal:74: c-sources: cbits/HsNet.c
zlib.cabal:89: c-sources: cbits/adler32.c cbits/compress.c cbits/crc32.c
integer-gmp.cabal:65: c-sources:
hashable.cabal:64: C-sources:
hashable.cabal:71: c-sources: cbits/getRandomBytes.c
hashable.cabal:141: c-sources:
hashable.cabal:149: c-sources:
hashable.cabal:154: c-sources:
hashable.cabal:161: c-sources: cbits/getRandomBytes.c
primitive.cabal:59: c-sources: cbits/primitive-memops.c
clock.cabal:74: c-sources: cbits/hs_clock_darwin.c
clock.cabal:76: c-sources: cbits/hs_clock_win32.c
cmark-gfm.cabal:82: c-sources: cbits/houdini_html_u.c
streaming-commons.cabal:61: c-sources: cbits/zlib-helper.c
text-short.cabal:51: c-sources: cbits/memcmp.c
text-short.cabal:65: c-sources: cbits/cbits.c
ghc-prim.cabal:70: c-sources:
process.cabal:66: c-sources:
bytestring.cabal:129: c-sources: cbits/fpstring.c
bytestring.cabal:157: c-sources: cbits/fpstring.c
bytestring.cabal:178: c-sources: cbits/fpstring.c
bytestring.cabal:221: c-sources: cbits/fpstring.c
text.cabal:93: c-sources: cbits/cbits.c
text.cabal:173: c-sources: cbits/cbits.c
unix.cabal:133: c-sources:
old-time.cabal:48: c-sources:
hourglass.cabal:52: c-sources: cbits/unix.c
foundation.cabal:169: c-sources: cbits/foundation_random.c
criterion.cabal:77: c-sources: cbits/cycles.c
criterion.cabal:79: c-sources: cbits/time-osx.c
criterion.cabal:82: c-sources: cbits/time-windows.c
criterion.cabal:84: c-sources: cbits/time-posix.c
yaml.cabal:77: c-sources: c/helper.c
yaml.cabal:84: c-sources: libyaml/api.c,
binary.cabal:113: c-sources: benchmarks/CBenchmark.c
hslua.cabal:103: c-sources: safer-api/safer-api.c
hslua.cabal:124: c-sources: lua-5.3.4/lapi.c
regex-pcre-builtin.cabal:73: C-sources:
base.cabal:324: c-sources:
Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available.
It would not be an option, probably.
But, honestly, the places where we use C libraries are not the performance bottlenecks. You shouldn't see much difference, e.g., because of the switch-out of the YAML parser.
Kolen Cheung notifications@github.com writes:
Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/4535#issuecomment-430086856
If it is one way or the other, then I think pure Haskell with flexibility is better than a bit faster (honestly pandoc’s power is always its flexibility not performance, which is more important most of the time.)
I have to revise what I said about the YAML parser.
See #6084 - we may need to go back to using yaml
.
I've got a newregex
branch of skylighting now that uses a Haskell implementation of KDE regexes.
It's 2-4 times slower on benchmarks. (EDIT: added later, I've managed to improve the benchmarks quite a bit. Most are about 2x now.)
Profiling suggests that the time is spent on compiling rather than matching regexes.
21.1 13.4 - Text.Regex.KDE.Compile compileRegex (665)
21.1 13.4 - Text.Regex.KDE.Compile compileRegex.parser (665)
21.1 13.4 1.1 Data.Attoparsec.Internal.Types >>= (150410)
20.0 12.3 3.5 Data.Attoparsec.Internal.Types >>=.\ (372896)
15.2 8.6 7.8 Data.Attoparsec.Internal.Types >>=.\.succ' (250817)
2.4 - - Text.Regex.KDE.Compile pRegex (1631)
1.3 .2 - Data.Attoparsec.Internal.Types fail (98371)
1.3 .2 .3 Data.Attoparsec.Internal.Types fail.\ (98384)
.5 - - Text.Regex.KDE.Compile pRegex (0)
.3 - - Text.Regex.KDE.Compile pSuffix (0)
.3 - - Data.Attoparsec.Internal.Types plus (0)
1.3 - - Data.Attoparsec.Internal.Types pure (70980)
.5 - - Data.Attoparsec.Internal.Types plus (20678)
.5 - - Text.Regex.KDE.Compile char (0)
.5 .2 .3 Text.Regex.KDE.Compile pRegexCharClass.getC (0)
.3 - - Text.Regex.KDE.Compile char (0)
.5 - - Text.Regex.KDE.Compile pSuffix (6817)
.3 .3 - Data.Attoparsec.Internal.Types many (0)
.3 .3 .3 Data.Attoparsec.Internal.Types many.some_v (0)
.8 - - Text.Regex.KDE.Compile pRegex (1229)
.5 .2 - Data.Attoparsec.Internal demandInput (0)
.5 .2 - Data.Attoparsec.Internal demandInput.\ (8577)
.5 .2 .3 Data.Attoparsec.Internal.Types >>=.\.succ' (3825)
.3 - - Text.Regex.KDE.Compile pRegex (0)
2.4 2.4 - Data.Attoparsec.Internal.Types plus (52569)
2.4 2.4 1.6 Data.Attoparsec.Internal.Types plus.\ (121358)
.8 .8 .6 Data.Attoparsec.Internal.Types plus.\.lose' (54707)
.3 .3 .3 Text.Regex.KDE.Compile pRegexCharClass (0)
1.9 1.9 1.9 Skylighting.Syntax.Php syntax (1)
3.7 1.8 1.7 Text.Regex.KDE.Compile pRegexPart (2860)
1.8 .2 .3 Text.Regex.KDE.Compile pRegexChar (2860)
1.3 - - Data.Attoparsec.Internal.Types plus (11440)
.3 - - Text.Regex.KDE.Compile pRegexAsciiChar (0)
.3 - - Data.Attoparsec.Internal.Types fmap (0)
.3 - - Text.Regex.KDE.Compile withGroupModifiers (0)
.3 - - Text.Regex.KDE.Compile char (0)
2.0 1.7 - Data.Attoparsec.Internal.Types fmap (8365)
2.0 1.7 1.1 Data.Attoparsec.Internal.Types fmap.\ (57527)
.6 .6 .6 Data.Attoparsec.Internal.Types fmap.\.succ' (4518)
.3 - - Text.Regex.KDE.Compile pRegexPart (0)
I'm releasing skylighting 0.9, with no C dependencies. Once pandoc depends on this, we'll have checked off the main three C dependencies listed above. It would be worth checking more carefully to see if there are others downstream.
Hi, just trying to clarify about the YAML situation in pandoc.
First a short summary,
So the question is which direction are we going? HsYAML or yaml? YAML 1.1 or 1.2?
Thanks.
Edit: add reply from below to summary, reformat summary a bit.
Answer: I don't know. I like avoiding the C dependency. I don't know how much the performance issue is affecting users. For now, we stay with HsYAML. But I'd be happier with that if upstream were more responsive in looking at the performance issue.
Running the command in #6084, pandoc 2.7.3 vs pandoc 2.11.3 differs from 1.42s to 16.84s on my computer. So the ratio is like a factor of dozen. So I don't know if the gap is big enough to make a difference—how close to C performance can we expect? Even if they can reduce it by a factor of 2, it would still be ~8s which to a human is not that discernible.
Also, I think it is still roughly linear time (https://github.com/jgm/pandoc/issues/6084#issuecomment-748017482), so is the problem is really that problematic?
Lastly the example over there probably is a practical upper bound (how big do we expect people injecting bibliography?) Less than a minute is still quite good in worst case scenario.
Another idea: could we add a JSON reader in citeproc and ask people for large bibliography to inject JSON instead? Then people can preprocess YAML to JSON, and inject the JSON to pandoc/citeproc. (With the assumption that reading JSON should be much faster.)
Yes, it's linear. But the difference between 25 seconds and 5 seconds is a significant one, probably enough to discourage anyone using a YAML bibliography that size. Maybe that's okay. They can always use CSL JSON (already possible), which is pretty fast. Bibtex will have performance intermediate between the two.
I’d say a factor of 5 between Haskell and C is pretty good. Pandoc vs the fastest markdown parser probably has a much greater ratio.
But on the other hand, the scenario to have a very big bibliography (where the user might only cite a few from it) disproportional to the length of what they may have in markdown seems more probable (comparing to having a huge markdown.)
We should be much closer now that we've separated out Lua. Is there something left?
Edit: I forgot that we switched back from HsYAML to yaml, so I guess that's the last roadblock now.
On top of my head, probably yaml still is a C dependency because of performance issue.
Following from this discussion there seems to be a value to locating non-Haskell dependencies and replacing them with Haskell.
@jgm has identified the following (comments his):
[x] skylighting - depends on pcre - not clear how we can eliminate that without a pure Haskell regex library that matches pcre's features. (Perhaps we could provide a compiler flag to build without highlighting support, however.)
[x] yaml - depends on libyaml - there are some proposals in Google Summer of Code for writing a better pure Haskell yaml parser, so this may be done in the relatively near future.
[x] commonmark/gfm - depends on libcmark - I have been working on a pure Haskell commonmark parser, which isn't yet published but which could substitute for libcmark (at the cost of reduced performance). https://github.com/jgm/commonmark-hs
[ ] confirm absence of other dependencies using HackageDB to chase down all of pandoc's transitive dependencies and exhaustively enumerate those that depend on foreign libraries.