jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.2k stars 3.3k forks source link

Removing non-Haskell dependencies (C dependencies) #4535

Open aolney opened 6 years ago

aolney commented 6 years ago

Following from this discussion there seems to be a value to locating non-Haskell dependencies and replacing them with Haskell.

@jgm has identified the following (comments his):

jgm commented 6 years ago

The following cabal files in pandoc's transitive dependencies contain c-sources stanzas. In some cases they may only be used for testing or benchmarking.

time.cabal:72:    c-sources: lib/cbits/HsTime.c
time.cabal:179:    c-sources: test/unix/Test/Format/FormatStuff.c
aeson.cabal:156:    c-sources: cbits/unescape_string.c
aeson.cabal:166:  c-sources: cbits/unescape_string.c
cryptonite.cabal:240:  C-sources:         cbits/cryptonite_chacha.c
cryptonite.cabal:273:    C-sources:         cbits/decaf/p448/arch_ref64/f_impl.c
cryptonite.cabal:284:    C-sources:         cbits/decaf/p448/arch_32/f_impl.c
cryptonite.cabal:296:    C-sources: cbits/curve25519/curve25519-donna-c64.c
cryptonite.cabal:298:    C-sources: cbits/curve25519/curve25519-donna.c
cryptonite.cabal:315:    c-sources:      cbits/cryptonite_rdrand.c
cryptonite.cabal:321:    C-sources:       cbits/aes/x86ni.c
cryptonite.cabal:326:    C-sources:       cbits/aes/generic.c
cryptonite.cabal:331:    C-sources:      cbits/blake2/sse/blake2s.c
cryptonite.cabal:337:    C-sources:      cbits/blake2/ref/blake2s-ref.c
cryptonite.cabal:346:  C-sources:      cbits/argon2/argon2.c
basement.cabal:158:  c-sources:         cbits/foundation_mem.c
network.cabal:74:  c-sources: cbits/HsNet.c
zlib.cabal:89:      c-sources:   cbits/adler32.c cbits/compress.c cbits/crc32.c
integer-gmp.cabal:65:  c-sources:
hashable.cabal:64:  C-sources:
hashable.cabal:71:    c-sources:         cbits/getRandomBytes.c
hashable.cabal:141:  c-sources:
hashable.cabal:149:    c-sources:
hashable.cabal:154:      c-sources:
hashable.cabal:161:    c-sources:         cbits/getRandomBytes.c
primitive.cabal:59:  c-sources: cbits/primitive-memops.c
clock.cabal:74:      c-sources:         cbits/hs_clock_darwin.c
clock.cabal:76:      c-sources:         cbits/hs_clock_win32.c
cmark-gfm.cabal:82:    c-sources:         cbits/houdini_html_u.c
streaming-commons.cabal:61:  c-sources:           cbits/zlib-helper.c
text-short.cabal:51:     c-sources:        cbits/memcmp.c
text-short.cabal:65:  c-sources: cbits/cbits.c
ghc-prim.cabal:70:    c-sources:
process.cabal:66:    c-sources:
bytestring.cabal:129:  c-sources:         cbits/fpstring.c
bytestring.cabal:157:  c-sources:        cbits/fpstring.c
bytestring.cabal:178:  c-sources:        cbits/fpstring.c
bytestring.cabal:221:  c-sources:        cbits/fpstring.c
text.cabal:93:  c-sources:    cbits/cbits.c
text.cabal:173:  c-sources:      cbits/cbits.c
unix.cabal:133:    c-sources:
old-time.cabal:48:    c-sources:
hourglass.cabal:52:     c-sources:      cbits/unix.c
foundation.cabal:169:  c-sources:         cbits/foundation_random.c
criterion.cabal:77:  c-sources: cbits/cycles.c
criterion.cabal:79:    c-sources: cbits/time-osx.c
criterion.cabal:82:      c-sources: cbits/time-windows.c
criterion.cabal:84:      c-sources: cbits/time-posix.c
yaml.cabal:77:    c-sources:       c/helper.c
yaml.cabal:84:            c-sources:       libyaml/api.c,
binary.cabal:113:  c-sources: benchmarks/CBenchmark.c
hslua.cabal:103:    c-sources:          safer-api/safer-api.c
hslua.cabal:124:    c-sources:          lua-5.3.4/lapi.c
regex-pcre-builtin.cabal:73:  C-sources:
base.cabal:324:    c-sources:
ickc commented 5 years ago

Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available.

jgm commented 5 years ago

It would not be an option, probably.

But, honestly, the places where we use C libraries are not the performance bottlenecks. You shouldn't see much difference, e.g., because of the switch-out of the YAML parser.

Kolen Cheung notifications@github.com writes:

Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/4535#issuecomment-430086856

ickc commented 5 years ago

If it is one way or the other, then I think pure Haskell with flexibility is better than a bit faster (honestly pandoc’s power is always its flexibility not performance, which is more important most of the time.)

jgm commented 3 years ago

I have to revise what I said about the YAML parser. See #6084 - we may need to go back to using yaml.

jgm commented 3 years ago

I've got a newregex branch of skylighting now that uses a Haskell implementation of KDE regexes. It's 2-4 times slower on benchmarks. (EDIT: added later, I've managed to improve the benchmarks quite a bit. Most are about 2x now.) Profiling suggests that the time is spent on compiling rather than matching regexes.

21.1  13.4     -  Text.Regex.KDE.Compile compileRegex (665)
21.1  13.4     -    Text.Regex.KDE.Compile compileRegex.parser (665)
21.1  13.4   1.1      Data.Attoparsec.Internal.Types >>= (150410)
20.0  12.3   3.5        Data.Attoparsec.Internal.Types >>=.\ (372896)
15.2   8.6   7.8          Data.Attoparsec.Internal.Types >>=.\.succ' (250817)
 2.4     -     -            Text.Regex.KDE.Compile pRegex (1631)
 1.3    .2     -            Data.Attoparsec.Internal.Types fail (98371)
 1.3    .2    .3              Data.Attoparsec.Internal.Types fail.\ (98384)
  .5     -     -                Text.Regex.KDE.Compile pRegex (0)
  .3     -     -                Text.Regex.KDE.Compile pSuffix (0)
  .3     -     -                Data.Attoparsec.Internal.Types plus (0)
 1.3     -     -            Data.Attoparsec.Internal.Types pure (70980)
  .5     -     -            Data.Attoparsec.Internal.Types plus (20678)
  .5     -     -            Text.Regex.KDE.Compile char (0)
  .5    .2    .3            Text.Regex.KDE.Compile pRegexCharClass.getC (0)
  .3     -     -              Text.Regex.KDE.Compile char (0)
  .5     -     -            Text.Regex.KDE.Compile pSuffix (6817)
  .3    .3     -            Data.Attoparsec.Internal.Types many (0)
  .3    .3    .3              Data.Attoparsec.Internal.Types many.some_v (0)
  .8     -     -          Text.Regex.KDE.Compile pRegex (1229)
  .5    .2     -          Data.Attoparsec.Internal demandInput (0)
  .5    .2     -            Data.Attoparsec.Internal demandInput.\ (8577)
  .5    .2    .3              Data.Attoparsec.Internal.Types >>=.\.succ' (3825)
  .3     -     -                Text.Regex.KDE.Compile pRegex (0)

 2.4   2.4     -  Data.Attoparsec.Internal.Types plus (52569)
 2.4   2.4   1.6    Data.Attoparsec.Internal.Types plus.\ (121358)
  .8    .8    .6      Data.Attoparsec.Internal.Types plus.\.lose' (54707)
  .3    .3    .3        Text.Regex.KDE.Compile pRegexCharClass (0)

 1.9   1.9   1.9  Skylighting.Syntax.Php syntax (1)

 3.7   1.8   1.7  Text.Regex.KDE.Compile pRegexPart (2860)
 1.8    .2    .3    Text.Regex.KDE.Compile pRegexChar (2860)
 1.3     -     -      Data.Attoparsec.Internal.Types plus (11440)
  .3     -     -      Text.Regex.KDE.Compile pRegexAsciiChar (0)
  .3     -     -        Data.Attoparsec.Internal.Types fmap (0)
  .3     -     -    Text.Regex.KDE.Compile withGroupModifiers (0)
  .3     -     -      Text.Regex.KDE.Compile char (0)

 2.0   1.7     -  Data.Attoparsec.Internal.Types fmap (8365)
 2.0   1.7   1.1    Data.Attoparsec.Internal.Types fmap.\ (57527)
  .6    .6    .6      Data.Attoparsec.Internal.Types fmap.\.succ' (4518)
  .3     -     -      Text.Regex.KDE.Compile pRegexPart (0)
jgm commented 3 years ago

I'm releasing skylighting 0.9, with no C dependencies. Once pandoc depends on this, we'll have checked off the main three C dependencies listed above. It would be worth checking more carefully to see if there are others downstream.

ickc commented 3 years ago

Hi, just trying to clarify about the YAML situation in pandoc.

First a short summary,

So the question is which direction are we going? HsYAML or yaml? YAML 1.1 or 1.2?

Thanks.

Edit: add reply from below to summary, reformat summary a bit.

jgm commented 3 years ago

Answer: I don't know. I like avoiding the C dependency. I don't know how much the performance issue is affecting users. For now, we stay with HsYAML. But I'd be happier with that if upstream were more responsive in looking at the performance issue.

ickc commented 3 years ago

Running the command in #6084, pandoc 2.7.3 vs pandoc 2.11.3 differs from 1.42s to 16.84s on my computer. So the ratio is like a factor of dozen. So I don't know if the gap is big enough to make a difference—how close to C performance can we expect? Even if they can reduce it by a factor of 2, it would still be ~8s which to a human is not that discernible.

Also, I think it is still roughly linear time (https://github.com/jgm/pandoc/issues/6084#issuecomment-748017482), so is the problem is really that problematic?

Lastly the example over there probably is a practical upper bound (how big do we expect people injecting bibliography?) Less than a minute is still quite good in worst case scenario.

Another idea: could we add a JSON reader in citeproc and ask people for large bibliography to inject JSON instead? Then people can preprocess YAML to JSON, and inject the JSON to pandoc/citeproc. (With the assumption that reading JSON should be much faster.)

jgm commented 3 years ago

Yes, it's linear. But the difference between 25 seconds and 5 seconds is a significant one, probably enough to discourage anyone using a YAML bibliography that size. Maybe that's okay. They can always use CSL JSON (already possible), which is pretty fast. Bibtex will have performance intermediate between the two.

ickc commented 3 years ago

I’d say a factor of 5 between Haskell and C is pretty good. Pandoc vs the fastest markdown parser probably has a much greater ratio.

But on the other hand, the scenario to have a very big bibliography (where the user might only cite a few from it) disproportional to the length of what they may have in markdown seems more probable (comparing to having a huge markdown.)

tarleb commented 1 year ago

We should be much closer now that we've separated out Lua. Is there something left?

Edit: I forgot that we switched back from HsYAML to yaml, so I guess that's the last roadblock now.

ickc commented 1 year ago

On top of my head, probably yaml still is a C dependency because of performance issue.