jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.29k stars 3.37k forks source link

Split off core package #6215

Open tarleb opened 4 years ago

tarleb commented 4 years ago

I'm wondering whether it would make sense to split off parts of pandoc into a separate pandoc-core package. This would make it easier to move other parts into separate packages as well.

My motivation here is the Lua system. It is growing quite large, but, with the exception of the pandoc.read function, is built only on a small part of pandoc the library. The pandoc core (i.e., T.P.Class etc) as well as the Lua system are relatively stable, so the overhead of having additional packages to maintain seems acceptable.

In a similar vein: while writing jira-wiki-markup, I would have liked to have a pandoc-parsing library. Depending on such library would make it easy to ensure that library uses the same parser as pandoc. It could include some of the fixes and convenience functions available in Text.Pandoc.Parsing.

jgm commented 4 years ago

I'm open to exploring this, but what exactly would you conceive as being the core modules?

tarleb commented 4 years ago

I'd considere everything required to define PandocMonad as "core", so the modules

plus the function uriPathToPath from T.P.Shared.

Additionally maybe Emoji, UUID, and XML, although those would increase the dependency footprint.

jgm commented 4 years ago

I'm still not sure I understand the motivation. This would allow creation of a package pandoc-lua with the lua system. This package would depend on pandoc-core. But what would depend on this package, besides pandoc itself? IF the answer is nothing, then I'm not sure it's worth the hassle of splitting.

tarleb commented 4 years ago

A main motivation for me is compile time: e.g., switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules. Finding a way to reduce compile times would remove a huge bottleneck from my workflow. Splitting of smaller modules seems like a good option to achieve this, and should also reduce the frequency with which I'd have to switch branches.

jgm commented 4 years ago

switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules

One way to deal with this kind of thing is to clone the branch in a separate directory.

alerque commented 4 years ago

switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules

One way to deal with this kind of thing is to clone the branch in a separate directory.

Along these lines git subtree is quite useful for this too because it shares the object store with the main repo and so is fast and easy on the file system.

Also –and this is a bit for advanced foo– it is possible to stage and commit patches against branches that are not currently checked out at all. If you see little fixups that need committing somewhere other than the branch you are on there are ways to make the change but commit them to a different branch. A poor-man's way to do this is just have fun with stashes, but there also git tools to actually patch branches without checking them out.

Not directly related to the above but also very useful for keeping things like rebases from causing rebuilds, git revise is great for editing earlier commits without touching the file system and hence triggering rebuilds.

tarleb commented 4 years ago

There is also ghcid, which I found very convenient and easy to use in other projects, but so far wasn't able to really use it with pandoc. This is due mostly to the size of the library and tests. E.g., I wasn't able yet how to restrict the number of tests to run.

Thanks for the hints @alerque. I grew lazy and usually just use Emacs with magit for most git tasks, but I'll checkout the things you mentioned.

jgm commented 4 years ago

See https://www.reddit.com/r/haskell/comments/fz3s2y/hakyll_status/ which notes that pandoc takes a lot of memory to compile. It's possible that splitting pandoc would help with this. On the other hand, this would make things less convenient for developers in many cases.

jgm commented 4 years ago

I guess part of the idea here would be to split off the lua system into a separate package, depending on pandoc-core (or whatever it is called)?

jgm commented 4 years ago

I'm warming to this proposal. I'm wondering whether pandoc-core is the right name, though. One might expect that to include things like Shared and Parsing -- things you need to write a reader or writer. Maybe everything except the readers and writers themselves, App, PDF, and SelfContained?

jgm commented 4 years ago

Maybe

pandoc

(Prelude) Text.Pandoc Text.Pandoc.App Text.Pandoc.App.CommandLineOptions Text.Pandoc.App.FormatHeuristics Text.Pandoc.App.Opt Text.Pandoc.App.OutputSettings Text.Pandoc.Highlighting Text.Pandoc.PDF Text.Pandoc.RoffChar Text.Pandoc.Readers Text.Pandoc.Readers.HTML Text.Pandoc.Readers.LaTeX Text.Pandoc.Readers.LaTeX.Types Text.Pandoc.Readers.Markdown Text.Pandoc.Readers.CommonMark Text.Pandoc.Readers.Creole Text.Pandoc.Readers.MediaWiki Text.Pandoc.Readers.Vimwiki Text.Pandoc.Readers.RST Text.Pandoc.Readers.Org Text.Pandoc.Readers.DocBook Text.Pandoc.Readers.JATS Text.Pandoc.Readers.Jira Text.Pandoc.Readers.OPML Text.Pandoc.Readers.Textile Text.Pandoc.Readers.Native Text.Pandoc.Readers.Haddock Text.Pandoc.Readers.TWiki Text.Pandoc.Readers.TikiWiki Text.Pandoc.Readers.Txt2Tags Text.Pandoc.Readers.Docx Text.Pandoc.Readers.Odt Text.Pandoc.Readers.EPUB Text.Pandoc.Readers.Muse Text.Pandoc.Readers.Man Text.Pandoc.Readers.FB2 Text.Pandoc.Readers.DokuWiki Text.Pandoc.Readers.Ipynb Text.Pandoc.Readers.CSV Text.Pandoc.Readers.Docx.Lists Text.Pandoc.Readers.Docx.Combine Text.Pandoc.Readers.Docx.Parse Text.Pandoc.Readers.Docx.Parse.Styles Text.Pandoc.Readers.Docx.Util Text.Pandoc.Readers.Docx.Fields Text.Pandoc.Readers.LaTeX.Parsing Text.Pandoc.Readers.LaTeX.Lang Text.Pandoc.Readers.Odt.Base Text.Pandoc.Readers.Odt.Namespaces Text.Pandoc.Readers.Odt.StyleReader Text.Pandoc.Readers.Odt.ContentReader Text.Pandoc.Readers.Odt.Generic.Fallible Text.Pandoc.Readers.Odt.Generic.SetMap Text.Pandoc.Readers.Odt.Generic.Utils Text.Pandoc.Readers.Odt.Generic.Namespaces Text.Pandoc.Readers.Odt.Generic.XMLConverter Text.Pandoc.Readers.Odt.Arrows.State Text.Pandoc.Readers.Odt.Arrows.Utils Text.Pandoc.Readers.Org.BlockStarts Text.Pandoc.Readers.Org.Blocks Text.Pandoc.Readers.Org.DocumentTree Text.Pandoc.Readers.Org.ExportSettings Text.Pandoc.Readers.Org.Inlines Text.Pandoc.Readers.Org.Meta Text.Pandoc.Readers.Org.ParserState Text.Pandoc.Readers.Org.Parsing Text.Pandoc.Readers.Org.Shared Text.Pandoc.Readers.Metadata Text.Pandoc.Readers.Roff Text.Pandoc.Writers.Docx.StyleMap Text.Pandoc.Writers.Roff Text.Pandoc.Writers.Powerpoint.Presentation Text.Pandoc.Writers.Powerpoint.Output Text.Pandoc.Writers Text.Pandoc.Writers.Native Text.Pandoc.Writers.Docbook Text.Pandoc.Writers.JATS Text.Pandoc.Writers.OPML Text.Pandoc.Writers.HTML Text.Pandoc.Writers.Ipynb Text.Pandoc.Writers.ICML Text.Pandoc.Writers.Jira Text.Pandoc.Writers.LaTeX Text.Pandoc.Writers.ConTeXt Text.Pandoc.Writers.OpenDocument Text.Pandoc.Writers.Texinfo Text.Pandoc.Writers.Man Text.Pandoc.Writers.Ms Text.Pandoc.Writers.Markdown Text.Pandoc.Writers.CommonMark Text.Pandoc.Writers.Haddock Text.Pandoc.Writers.RST Text.Pandoc.Writers.Org Text.Pandoc.Writers.AsciiDoc Text.Pandoc.Writers.Custom Text.Pandoc.Writers.Textile Text.Pandoc.Writers.MediaWiki Text.Pandoc.Writers.DokuWiki Text.Pandoc.Writers.XWiki Text.Pandoc.Writers.ZimWiki Text.Pandoc.Writers.RTF Text.Pandoc.Writers.ODT Text.Pandoc.Writers.Docx Text.Pandoc.Writers.Powerpoint Text.Pandoc.Writers.EPUB Text.Pandoc.Writers.FB2 Text.Pandoc.Writers.TEI Text.Pandoc.Writers.Muse Text.Pandoc.Writers.OOXML

pandoc-core

(Prelude) Text.Pandoc.Options Text.Pandoc.Extensions Text.Pandoc.Shared Text.Pandoc.MediaBag Text.Pandoc.Error Text.Pandoc.Filter Text.Pandoc.UTF8 Text.Pandoc.Templates Text.Pandoc.XML Text.Pandoc.SelfContained Text.Pandoc.Logging Text.Pandoc.Process Text.Pandoc.MIME Text.Pandoc.Parsing Text.Pandoc.Asciify Text.Pandoc.Emoji Text.Pandoc.ImageSize Text.Pandoc.BCP47 Text.Pandoc.Class Text.Pandoc.Class.CommonState Text.Pandoc.Class.PandocMonad Text.Pandoc.Class.PandocIO Text.Pandoc.Class.PandocPure Text.Pandoc.Filter.JSON Text.Pandoc.Filter.Lua Text.Pandoc.Filter.Path Text.Pandoc.CSS Text.Pandoc.CSV Text.Pandoc.UUID Text.Pandoc.Translations Text.Pandoc.Slides Text.Pandoc.Image Text.Pandoc.Writers.Math Text.Pandoc.Writers.Shared

pandoc-lua

(Prelude) Text.Pandoc.Lua Text.Pandoc.Lua.Filter Text.Pandoc.Lua.Global Text.Pandoc.Lua.Init Text.Pandoc.Lua.Marshaling Text.Pandoc.Lua.Marshaling.AST Text.Pandoc.Lua.Marshaling.AnyValue Text.Pandoc.Lua.Marshaling.CommonState Text.Pandoc.Lua.Marshaling.Context Text.Pandoc.Lua.Marshaling.List Text.Pandoc.Lua.Marshaling.MediaBag Text.Pandoc.Lua.Marshaling.ReaderOptions Text.Pandoc.Lua.Marshaling.Version Text.Pandoc.Lua.Module.MediaBag Text.Pandoc.Lua.Module.Pandoc Text.Pandoc.Lua.Module.System Text.Pandoc.Lua.Module.Types Text.Pandoc.Lua.Module.Utils Text.Pandoc.Lua.Packages Text.Pandoc.Lua.Util Text.Pandoc.Lua.Walk

jgm commented 4 years ago

I'd like to get the table changes merged first, though, before messing with this.

jgm commented 4 years ago

One complication: PandocMonad depends on Text.Pandoc.Data (dataFiles) when embed_data_files is turned on. That means that Data, and all the data files, would have to go in core. This seems conceptually wrong to me. The templates, for example, naturally go with pandoc, not pandoc-core. And some of the data files are things like the pandoc manual itself. I don't see a very clean solution to this.

jgm commented 4 years ago

Actually there is a clean solution. We could store a field for dataFiles in the CommonState of PandocMonad. (A bit tricky though because this means that anyone using pandoc as a library will have to remember to set this field in commonstate before running readers/writers....)

tarleb commented 4 years ago

This sounds really nice. We could keep the new packages in the same repo as the main app in the beginning, which should minimize friction (and preserve the git history).

Remaining problems: there probably needs to be a mechanism to decouple Text.Pandoc.Filter.Lua from T.P.Lua, or that module cannot be in pandoc-core. Also, the Lua module must be changed such that functions getReaders can be injected, or we'd run into a dependency loop.

jgm commented 4 years ago

Can you look into those remaining issues to see if you can find a solution? I don't want to mess with these changes if it's not going to work in the end. Multiple packages in the same repo is the way to go, I think, now that the tooling supports this well -- we might even think about bringing in pandoc-types eventually.

jgm commented 4 years ago

Btw, it wouldn't be disastrous if Text.Pandoc.Filter had to go in pandoc rather than pandoc-core, because of the lua dependency. I'm more worried about potential circular dependencies in the lua stuff. E.g. I notice that Lua.Module.Utils imports T.P.Filter.JSON. I guess we could have T.P.Filter.JSON in core and the rest of the filter stuff in pandoc, though.

tarleb commented 4 years ago

Yes, I'll look into it.

I guess it should it be ok to use Template Haskell to remove pandoc.lua and pandoc.List.lua from the data files? Including the Lua via quasiquotes in seems like a clean and easy solution, and if I remember correctly, we already depend on TH and no longer support building without it.

jgm commented 4 years ago

I fooled around a bit with the idea mentioned above for data files. I made T.P.Data an exported module, exporting initializeDataFiles, which initializes stDataFiles in common state with the baked in data. Problem is, you need to remember to run this every time you run a PandocMonad, and that's fragile. Maybe we'll need to provide wrappers for runIOEither and runIOorExplode in the pandoc package, which ensure that this initialization step is always done?

jgm commented 4 years ago

I guess it should it be ok to use Template Haskell to remove pandoc.lua and pandoc.List.lua from the data files? Including the Lua via quasiquotes in seems like a clean and easy solution, and if I remember correctly, we already depend on TH and no longer support building without it.

Correct.

jgm commented 4 years ago

I just pushed an initialize-data-files branch which contains my idea for decoupling data files from pandoc-core. It's a bit awkward because you can't forget to add the initializeDataFiles when you run a PandocMonad instance. But it seems to work.

mb21 commented 4 years ago

I haven't been following this closely, so sorry if I'm missing something, but a few thoughts:

tarleb commented 4 years ago
  • so the main motivation is reducing compile-time? either when working on the lua subsystem, or when compiling normal pandoc? (using package-level build cache, or...?)

At least for me, that's the primary motivation. I also like the idea of having additional clear delimitations in the code-base and serves as a motivation to untangle the dependency graph (esp. with regard to T.P.Lua).

  • a potential downside would be that making refactorings that require changes to several packages becomes a lot harder, like we're already seeing with pandoc-types? Or is the idea to keep it in one git repository? a monorepo?

My understanding is that we are indeed aiming for a monorepo.

  • naming: I think pandoc should be the unambiguous name for the whole combined codebase... (well, we already have pandoc and pandoc-types, but whatever). But maybe pandoc-app, or pandoc-readers-writers instead?

This is tangential, but if we were to split "pandoc-executable" from "pandoc-the-library", then we could add a cabal.project.freeze file to fix the dependencies of the executable to specific versions – that would come in handy when building docker images.

jgm commented 4 years ago

Reducing the compile time for developers, and reducing the total memory required to compile pandoc (which is getting ever larger and has made it hard to build pandoc on some systems) are both motivations.

It's true that this would make development a bit more complicated, and that would have to be weighed heavily. I'm developing commonmark-hs this way, as four packages in one repository (also skylighting, skylighting-core), so I have some experience with it. It's not too bad, but you have to think about things like version numbers (if you follow the versioning policy for each package, then they will get out of sync, and this might lead to confusion; in skylighting we simply force them to be in sync but this isn't ideal).

I'm still not sure about he idea, in the end. I don't like the approach in my initialize-data-files branch and now I'm leaning towards thinking that maybe Text.Pandoc.Data and all the data files should, after all, go in core. This is ugly, because if you modify a writer and a template, for example, you'd have to modify two packages. But it's also ugly to require a special initalizeDataFiles command every time you do runIO or runPure.

mb21 commented 4 years ago

Yeah, if it's done as a monorepo, I think that could work... feels like as a developer, when you build master, you would just want to use the code that's currently on master as well for the other packages. But then you lose the ability to use the cache?

About how to split it, definitely the case of using pandoc the library vs the application should be part of that decision I think...

jgm commented 4 years ago

This is tangential, but if we were to split "pandoc-executable" from "pandoc-the-library", then we could add a cabal.project.freeze file to fix the dependencies of the executable to specific versions – that would come in handy when building docker images.

Nothing stops you from generating your own freeze files and using them for the docker images.

tarleb commented 4 years ago

This proposal has more implications than I assumed. The potential benefits would likely not match the investments, so I'm closing this. It is probably a good idea to untangle Lua dependencies regardless.

Nothing stops you from generating your own freeze files and using them for the docker images.

That is true, I'll do that. Other projects are doing the same, Alpine for example. There might still be value in making it more likely for all binaries out there use the same dependencies.

jgm commented 4 years ago

I wouldn't mind keeping this open; it still seems possibly worth doing, I just can't decide. (Unless you have new insights not mentioned above.)

tarleb commented 4 years ago

The only insight not mentioned is that I'll need to refactor HsLua as a prerequisite to untangle and improve T.P.Lua.. Refactoring will probably take quite a while, and I didn't want to leave a stale issue hanging around. I'll happily bring the topic up again once I'm confident that it could be completed in a predictable time-frame.

jgm commented 4 years ago

Might as well leave this open, though, since it contains some useful notes on what would be required.

tarleb commented 3 years ago

I'm happy to report that we have funding which allows me to spend a few days on this, probably around mid March. Thanks @arfon!

Also, restructuring of HsLua is underway. So far I'm spinning off various smaller packages. This allows me to get some experiences with monorepos. More involved updates to the HsLua packages, which should help with all this, are in progress, too.

tarleb commented 2 years ago

Closing, as the Lua engine and CLI program are now separate packages. The pandoc-core idea appears to troublesome for the limited advantages we'd get from it; I'm not pursuing it any further (see also #8340).

jgm commented 2 years ago

Note: #8348 opens up a path to extracting pandoc-class or pandoc-monad (T.P.Class hierarchy) without including all the data files. Not sure whether there's a point to this.

I don't yet see how we can cleanly extract pandoc-parsing, because it depends on the latex reader's types (via HasMacros). (Well, maybe this wouldn't be so bad, as it's just the types. We could even envision bringing them outside of the T.P.Readers.LaTeX tree.)

I'm going to open this just so we can keep thinking about it.

sullyj3 commented 6 months ago

It's worth noting that cabal now supports multiple "public sub-libraries" in a single package. From the docs:

Being able to include more than one public library in a package allows the separation of the unit of distribution (the package) from the unit of buildable code (the library). This is useful for Haskell projects with many libraries that are distributed together as it avoids duplication and potential inconsistencies.

jgm commented 6 months ago

This is interesting. Are there any Hackage projects that define multiple (public) libraries in a single cabal file?

sullyj3 commented 6 months ago

I'm not aware of any, The feature's pretty new.