Removing non-Haskell dependencies (C dependencies)

aolney commented 6 years ago

Following from this discussion there seems to be a value to locating non-Haskell dependencies and replacing them with Haskell.

@jgm has identified the following (comments his):

[x] skylighting - depends on pcre - not clear how we can eliminate that without a pure Haskell regex library that matches pcre's features. (Perhaps we could provide a compiler flag to build without highlighting support, however.)
[x] yaml - depends on libyaml - there are some proposals in Google Summer of Code for writing a better pure Haskell yaml parser, so this may be done in the relatively near future.
[x] commonmark/gfm - depends on libcmark - I have been working on a pure Haskell commonmark parser, which isn't yet published but which could substitute for libcmark (at the cost of reduced performance). https://github.com/jgm/commonmark-hs
[ ] confirm absence of other dependencies using HackageDB to chase down all of pandoc's transitive dependencies and exhaustively enumerate those that depend on foreign libraries.

jgm commented 6 years ago

The following cabal files in pandoc's transitive dependencies contain c-sources stanzas. In some cases they may only be used for testing or benchmarking.

time.cabal:72:    c-sources: lib/cbits/HsTime.c
time.cabal:179:    c-sources: test/unix/Test/Format/FormatStuff.c
aeson.cabal:156:    c-sources: cbits/unescape_string.c
aeson.cabal:166:  c-sources: cbits/unescape_string.c
cryptonite.cabal:240:  C-sources:         cbits/cryptonite_chacha.c
cryptonite.cabal:273:    C-sources:         cbits/decaf/p448/arch_ref64/f_impl.c
cryptonite.cabal:284:    C-sources:         cbits/decaf/p448/arch_32/f_impl.c
cryptonite.cabal:296:    C-sources: cbits/curve25519/curve25519-donna-c64.c
cryptonite.cabal:298:    C-sources: cbits/curve25519/curve25519-donna.c
cryptonite.cabal:315:    c-sources:      cbits/cryptonite_rdrand.c
cryptonite.cabal:321:    C-sources:       cbits/aes/x86ni.c
cryptonite.cabal:326:    C-sources:       cbits/aes/generic.c
cryptonite.cabal:331:    C-sources:      cbits/blake2/sse/blake2s.c
cryptonite.cabal:337:    C-sources:      cbits/blake2/ref/blake2s-ref.c
cryptonite.cabal:346:  C-sources:      cbits/argon2/argon2.c
basement.cabal:158:  c-sources:         cbits/foundation_mem.c
network.cabal:74:  c-sources: cbits/HsNet.c
zlib.cabal:89:      c-sources:   cbits/adler32.c cbits/compress.c cbits/crc32.c
integer-gmp.cabal:65:  c-sources:
hashable.cabal:64:  C-sources:
hashable.cabal:71:    c-sources:         cbits/getRandomBytes.c
hashable.cabal:141:  c-sources:
hashable.cabal:149:    c-sources:
hashable.cabal:154:      c-sources:
hashable.cabal:161:    c-sources:         cbits/getRandomBytes.c
primitive.cabal:59:  c-sources: cbits/primitive-memops.c
clock.cabal:74:      c-sources:         cbits/hs_clock_darwin.c
clock.cabal:76:      c-sources:         cbits/hs_clock_win32.c
cmark-gfm.cabal:82:    c-sources:         cbits/houdini_html_u.c
streaming-commons.cabal:61:  c-sources:           cbits/zlib-helper.c
text-short.cabal:51:     c-sources:        cbits/memcmp.c
text-short.cabal:65:  c-sources: cbits/cbits.c
ghc-prim.cabal:70:    c-sources:
process.cabal:66:    c-sources:
bytestring.cabal:129:  c-sources:         cbits/fpstring.c
bytestring.cabal:157:  c-sources:        cbits/fpstring.c
bytestring.cabal:178:  c-sources:        cbits/fpstring.c
bytestring.cabal:221:  c-sources:        cbits/fpstring.c
text.cabal:93:  c-sources:    cbits/cbits.c
text.cabal:173:  c-sources:      cbits/cbits.c
unix.cabal:133:    c-sources:
old-time.cabal:48:    c-sources:
hourglass.cabal:52:     c-sources:      cbits/unix.c
foundation.cabal:169:  c-sources:         cbits/foundation_random.c
criterion.cabal:77:  c-sources: cbits/cycles.c
criterion.cabal:79:    c-sources: cbits/time-osx.c
criterion.cabal:82:      c-sources: cbits/time-windows.c
criterion.cabal:84:      c-sources: cbits/time-posix.c
yaml.cabal:77:    c-sources:       c/helper.c
yaml.cabal:84:            c-sources:       libyaml/api.c,
binary.cabal:113:  c-sources: benchmarks/CBenchmark.c
hslua.cabal:103:    c-sources:          safer-api/safer-api.c
hslua.cabal:124:    c-sources:          lua-5.3.4/lapi.c
regex-pcre-builtin.cabal:73:  C-sources:
base.cabal:324:    c-sources:

ickc commented 5 years ago

Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available.

jgm commented 5 years ago

It would not be an option, probably.

But, honestly, the places where we use C libraries are not the performance bottlenecks. You shouldn't see much difference, e.g., because of the switch-out of the YAML parser.

Kolen Cheung notifications@github.com writes:

Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/4535#issuecomment-430086856

ickc commented 5 years ago

If it is one way or the other, then I think pure Haskell with flexibility is better than a bit faster (honestly pandoc’s power is always its flexibility not performance, which is more important most of the time.)

jgm commented 3 years ago

I have to revise what I said about the YAML parser. See #6084 - we may need to go back to using yaml.

jgm commented 3 years ago

I've got a newregex branch of skylighting now that uses a Haskell implementation of KDE regexes. It's 2-4 times slower on benchmarks. (EDIT: added later, I've managed to improve the benchmarks quite a bit. Most are about 2x now.) Profiling suggests that the time is spent on compiling rather than matching regexes.

21.1  13.4     -  Text.Regex.KDE.Compile compileRegex (665)
21.1  13.4     -    Text.Regex.KDE.Compile compileRegex.parser (665)
21.1  13.4   1.1      Data.Attoparsec.Internal.Types >>= (150410)
20.0  12.3   3.5        Data.Attoparsec.Internal.Types >>=.\ (372896)
15.2   8.6   7.8          Data.Attoparsec.Internal.Types >>=.\.succ' (250817)
 2.4     -     -            Text.Regex.KDE.Compile pRegex (1631)
 1.3    .2     -            Data.Attoparsec.Internal.Types fail (98371)
 1.3    .2    .3              Data.Attoparsec.Internal.Types fail.\ (98384)
  .5     -     -                Text.Regex.KDE.Compile pRegex (0)
  .3     -     -                Text.Regex.KDE.Compile pSuffix (0)
  .3     -     -                Data.Attoparsec.Internal.Types plus (0)
 1.3     -     -            Data.Attoparsec.Internal.Types pure (70980)
  .5     -     -            Data.Attoparsec.Internal.Types plus (20678)
  .5     -     -            Text.Regex.KDE.Compile char (0)
  .5    .2    .3            Text.Regex.KDE.Compile pRegexCharClass.getC (0)
  .3     -     -              Text.Regex.KDE.Compile char (0)
  .5     -     -            Text.Regex.KDE.Compile pSuffix (6817)
  .3    .3     -            Data.Attoparsec.Internal.Types many (0)
  .3    .3    .3              Data.Attoparsec.Internal.Types many.some_v (0)
  .8     -     -          Text.Regex.KDE.Compile pRegex (1229)
  .5    .2     -          Data.Attoparsec.Internal demandInput (0)
  .5    .2     -            Data.Attoparsec.Internal demandInput.\ (8577)
  .5    .2    .3              Data.Attoparsec.Internal.Types >>=.\.succ' (3825)
  .3     -     -                Text.Regex.KDE.Compile pRegex (0)

 2.4   2.4     -  Data.Attoparsec.Internal.Types plus (52569)
 2.4   2.4   1.6    Data.Attoparsec.Internal.Types plus.\ (121358)
  .8    .8    .6      Data.Attoparsec.Internal.Types plus.\.lose' (54707)
  .3    .3    .3        Text.Regex.KDE.Compile pRegexCharClass (0)

 1.9   1.9   1.9  Skylighting.Syntax.Php syntax (1)

 3.7   1.8   1.7  Text.Regex.KDE.Compile pRegexPart (2860)
 1.8    .2    .3    Text.Regex.KDE.Compile pRegexChar (2860)
 1.3     -     -      Data.Attoparsec.Internal.Types plus (11440)
  .3     -     -      Text.Regex.KDE.Compile pRegexAsciiChar (0)
  .3     -     -        Data.Attoparsec.Internal.Types fmap (0)
  .3     -     -    Text.Regex.KDE.Compile withGroupModifiers (0)
  .3     -     -      Text.Regex.KDE.Compile char (0)

 2.0   1.7     -  Data.Attoparsec.Internal.Types fmap (8365)
 2.0   1.7   1.1    Data.Attoparsec.Internal.Types fmap.\ (57527)
  .6    .6    .6      Data.Attoparsec.Internal.Types fmap.\.succ' (4518)
  .3     -     -      Text.Regex.KDE.Compile pRegexPart (0)

jgm commented 3 years ago

I'm releasing skylighting 0.9, with no C dependencies. Once pandoc depends on this, we'll have checked off the main three C dependencies listed above. It would be worth checking more carefully to see if there are others downstream.

ickc commented 3 years ago

Hi, just trying to clarify about the YAML situation in pandoc.

First a short summary,

a commit that switch from yaml to HsYAML: Use HsYAML instead of yaml for translations, YAML metadata. · jgm/pandoc@e49b830
Haskell lib info:
- yaml: Support for parsing and rendering YAML documents.
- HsYAML: Pure Haskell YAML 1.2 processor
Note on YAML verison:
- yaml has YAML 1.1 spec: 2005-01-18: YAML Ain’t Markup Language (YAML™) Version 1.1
- HsYAML has YAML 1.2 spec: 2009-10-01: YAML Ain’t Markup Language (YAML™) Version 1.2
to yaml or HsYAML?
- 2020-2-13: https://github.com/jgm/pandoc/issues/6084#issuecomment-585838371 seem to indicate @jgm want to go back from HsYAML to yaml
- 2020-7-30: https://github.com/jgm/pandoc/issues/4535#issuecomment-666948352 seem to indicate @jgm want to stick to the plan of removing non-Haskell dep
- (edit: add from @jgm's reply below) 2020-12-18: https://github.com/jgm/pandoc/issues/4535#issuecomment-748194457 indicate @jgm is still undecided
Note that this choice has a few implications
- performance (but I'm sure if HsYAML is the cause?): https://github.com/jgm/pandoc/issues/6084#issuecomment-748017482
- it indirectly choose the YAML spec, which will affect downstream, e.g.
  - https://github.com/rstudio/rmarkdown/issues/1474
  - the yaml lib that we may want to adopt in Python working with pandoc

So the question is which direction are we going? HsYAML or yaml? YAML 1.1 or 1.2?

Thanks.

Edit: add reply from below to summary, reformat summary a bit.

jgm commented 3 years ago

Answer: I don't know. I like avoiding the C dependency. I don't know how much the performance issue is affecting users. For now, we stay with HsYAML. But I'd be happier with that if upstream were more responsive in looking at the performance issue.

ickc commented 3 years ago

Running the command in #6084, pandoc 2.7.3 vs pandoc 2.11.3 differs from 1.42s to 16.84s on my computer. So the ratio is like a factor of dozen. So I don't know if the gap is big enough to make a difference—how close to C performance can we expect? Even if they can reduce it by a factor of 2, it would still be ~8s which to a human is not that discernible.

Also, I think it is still roughly linear time (https://github.com/jgm/pandoc/issues/6084#issuecomment-748017482), so is the problem is really that problematic?

Lastly the example over there probably is a practical upper bound (how big do we expect people injecting bibliography?) Less than a minute is still quite good in worst case scenario.

Another idea: could we add a JSON reader in citeproc and ask people for large bibliography to inject JSON instead? Then people can preprocess YAML to JSON, and inject the JSON to pandoc/citeproc. (With the assumption that reading JSON should be much faster.)

jgm commented 3 years ago

Yes, it's linear. But the difference between 25 seconds and 5 seconds is a significant one, probably enough to discourage anyone using a YAML bibliography that size. Maybe that's okay. They can always use CSL JSON (already possible), which is pretty fast. Bibtex will have performance intermediate between the two.

ickc commented 3 years ago

I’d say a factor of 5 between Haskell and C is pretty good. Pandoc vs the fastest markdown parser probably has a much greater ratio.

But on the other hand, the scenario to have a very big bibliography (where the user might only cite a few from it) disproportional to the length of what they may have in markdown seems more probable (comparing to having a huge markdown.)

tarleb commented 1 year ago

We should be much closer now that we've separated out Lua. Is there something left?

Edit: I forgot that we switched back from HsYAML to yaml, so I guess that's the last roadblock now.

ickc commented 1 year ago

On top of my head, probably yaml still is a C dependency because of performance issue.

jgm / pandoc

Removing non-Haskell dependencies (C dependencies) #4535