Open tarleb opened 4 years ago
I'm open to exploring this, but what exactly would you conceive as being the core modules?
I'd considere everything required to define PandocMonad
as "core", so the modules
plus the function uriPathToPath
from T.P.Shared.
Additionally maybe Emoji, UUID, and XML, although those would increase the dependency footprint.
I'm still not sure I understand the motivation. This would allow creation of a package pandoc-lua
with the lua system. This package would depend on pandoc-core
. But what would depend on this package, besides pandoc itself? IF the answer is nothing, then I'm not sure it's worth the hassle of splitting.
A main motivation for me is compile time: e.g., switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules. Finding a way to reduce compile times would remove a huge bottleneck from my workflow. Splitting of smaller modules seems like a good option to achieve this, and should also reduce the frequency with which I'd have to switch branches.
switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules
One way to deal with this kind of thing is to clone the branch in a separate directory.
switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules
One way to deal with this kind of thing is to clone the branch in a separate directory.
Along these lines git subtree
is quite useful for this too because it shares the object store with the main repo and so is fast and easy on the file system.
Also –and this is a bit for advanced foo– it is possible to stage and commit patches against branches that are not currently checked out at all. If you see little fixups that need committing somewhere other than the branch you are on there are ways to make the change but commit them to a different branch. A poor-man's way to do this is just have fun with stashes, but there also git tools to actually patch branches without checking them out.
Not directly related to the above but also very useful for keeping things like rebases from causing rebuilds, git revise
is great for editing earlier commits without touching the file system and hence triggering rebuilds.
There is also ghcid
, which I found very convenient and easy to use in other projects, but so far wasn't able to really use it with pandoc. This is due mostly to the size of the library and tests. E.g., I wasn't able yet how to restrict the number of tests to run.
Thanks for the hints @alerque. I grew lazy and usually just use Emacs with magit for most git tasks, but I'll checkout the things you mentioned.
See https://www.reddit.com/r/haskell/comments/fz3s2y/hakyll_status/ which notes that pandoc takes a lot of memory to compile. It's possible that splitting pandoc would help with this. On the other hand, this would make things less convenient for developers in many cases.
I guess part of the idea here would be to split off the lua system into a separate package, depending on pandoc-core (or whatever it is called)?
I'm warming to this proposal. I'm wondering whether pandoc-core
is the right name, though. One might expect that to include things like Shared and Parsing -- things you need to write a reader or writer. Maybe everything except the readers and writers themselves, App, PDF, and SelfContained?
Maybe
(Prelude) Text.Pandoc Text.Pandoc.App Text.Pandoc.App.CommandLineOptions Text.Pandoc.App.FormatHeuristics Text.Pandoc.App.Opt Text.Pandoc.App.OutputSettings Text.Pandoc.Highlighting Text.Pandoc.PDF Text.Pandoc.RoffChar Text.Pandoc.Readers Text.Pandoc.Readers.HTML Text.Pandoc.Readers.LaTeX Text.Pandoc.Readers.LaTeX.Types Text.Pandoc.Readers.Markdown Text.Pandoc.Readers.CommonMark Text.Pandoc.Readers.Creole Text.Pandoc.Readers.MediaWiki Text.Pandoc.Readers.Vimwiki Text.Pandoc.Readers.RST Text.Pandoc.Readers.Org Text.Pandoc.Readers.DocBook Text.Pandoc.Readers.JATS Text.Pandoc.Readers.Jira Text.Pandoc.Readers.OPML Text.Pandoc.Readers.Textile Text.Pandoc.Readers.Native Text.Pandoc.Readers.Haddock Text.Pandoc.Readers.TWiki Text.Pandoc.Readers.TikiWiki Text.Pandoc.Readers.Txt2Tags Text.Pandoc.Readers.Docx Text.Pandoc.Readers.Odt Text.Pandoc.Readers.EPUB Text.Pandoc.Readers.Muse Text.Pandoc.Readers.Man Text.Pandoc.Readers.FB2 Text.Pandoc.Readers.DokuWiki Text.Pandoc.Readers.Ipynb Text.Pandoc.Readers.CSV Text.Pandoc.Readers.Docx.Lists Text.Pandoc.Readers.Docx.Combine Text.Pandoc.Readers.Docx.Parse Text.Pandoc.Readers.Docx.Parse.Styles Text.Pandoc.Readers.Docx.Util Text.Pandoc.Readers.Docx.Fields Text.Pandoc.Readers.LaTeX.Parsing Text.Pandoc.Readers.LaTeX.Lang Text.Pandoc.Readers.Odt.Base Text.Pandoc.Readers.Odt.Namespaces Text.Pandoc.Readers.Odt.StyleReader Text.Pandoc.Readers.Odt.ContentReader Text.Pandoc.Readers.Odt.Generic.Fallible Text.Pandoc.Readers.Odt.Generic.SetMap Text.Pandoc.Readers.Odt.Generic.Utils Text.Pandoc.Readers.Odt.Generic.Namespaces Text.Pandoc.Readers.Odt.Generic.XMLConverter Text.Pandoc.Readers.Odt.Arrows.State Text.Pandoc.Readers.Odt.Arrows.Utils Text.Pandoc.Readers.Org.BlockStarts Text.Pandoc.Readers.Org.Blocks Text.Pandoc.Readers.Org.DocumentTree Text.Pandoc.Readers.Org.ExportSettings Text.Pandoc.Readers.Org.Inlines Text.Pandoc.Readers.Org.Meta Text.Pandoc.Readers.Org.ParserState Text.Pandoc.Readers.Org.Parsing Text.Pandoc.Readers.Org.Shared Text.Pandoc.Readers.Metadata Text.Pandoc.Readers.Roff Text.Pandoc.Writers.Docx.StyleMap Text.Pandoc.Writers.Roff Text.Pandoc.Writers.Powerpoint.Presentation Text.Pandoc.Writers.Powerpoint.Output Text.Pandoc.Writers Text.Pandoc.Writers.Native Text.Pandoc.Writers.Docbook Text.Pandoc.Writers.JATS Text.Pandoc.Writers.OPML Text.Pandoc.Writers.HTML Text.Pandoc.Writers.Ipynb Text.Pandoc.Writers.ICML Text.Pandoc.Writers.Jira Text.Pandoc.Writers.LaTeX Text.Pandoc.Writers.ConTeXt Text.Pandoc.Writers.OpenDocument Text.Pandoc.Writers.Texinfo Text.Pandoc.Writers.Man Text.Pandoc.Writers.Ms Text.Pandoc.Writers.Markdown Text.Pandoc.Writers.CommonMark Text.Pandoc.Writers.Haddock Text.Pandoc.Writers.RST Text.Pandoc.Writers.Org Text.Pandoc.Writers.AsciiDoc Text.Pandoc.Writers.Custom Text.Pandoc.Writers.Textile Text.Pandoc.Writers.MediaWiki Text.Pandoc.Writers.DokuWiki Text.Pandoc.Writers.XWiki Text.Pandoc.Writers.ZimWiki Text.Pandoc.Writers.RTF Text.Pandoc.Writers.ODT Text.Pandoc.Writers.Docx Text.Pandoc.Writers.Powerpoint Text.Pandoc.Writers.EPUB Text.Pandoc.Writers.FB2 Text.Pandoc.Writers.TEI Text.Pandoc.Writers.Muse Text.Pandoc.Writers.OOXML
(Prelude) Text.Pandoc.Options Text.Pandoc.Extensions Text.Pandoc.Shared Text.Pandoc.MediaBag Text.Pandoc.Error Text.Pandoc.Filter Text.Pandoc.UTF8 Text.Pandoc.Templates Text.Pandoc.XML Text.Pandoc.SelfContained Text.Pandoc.Logging Text.Pandoc.Process Text.Pandoc.MIME Text.Pandoc.Parsing Text.Pandoc.Asciify Text.Pandoc.Emoji Text.Pandoc.ImageSize Text.Pandoc.BCP47 Text.Pandoc.Class Text.Pandoc.Class.CommonState Text.Pandoc.Class.PandocMonad Text.Pandoc.Class.PandocIO Text.Pandoc.Class.PandocPure Text.Pandoc.Filter.JSON Text.Pandoc.Filter.Lua Text.Pandoc.Filter.Path Text.Pandoc.CSS Text.Pandoc.CSV Text.Pandoc.UUID Text.Pandoc.Translations Text.Pandoc.Slides Text.Pandoc.Image Text.Pandoc.Writers.Math Text.Pandoc.Writers.Shared
(Prelude) Text.Pandoc.Lua Text.Pandoc.Lua.Filter Text.Pandoc.Lua.Global Text.Pandoc.Lua.Init Text.Pandoc.Lua.Marshaling Text.Pandoc.Lua.Marshaling.AST Text.Pandoc.Lua.Marshaling.AnyValue Text.Pandoc.Lua.Marshaling.CommonState Text.Pandoc.Lua.Marshaling.Context Text.Pandoc.Lua.Marshaling.List Text.Pandoc.Lua.Marshaling.MediaBag Text.Pandoc.Lua.Marshaling.ReaderOptions Text.Pandoc.Lua.Marshaling.Version Text.Pandoc.Lua.Module.MediaBag Text.Pandoc.Lua.Module.Pandoc Text.Pandoc.Lua.Module.System Text.Pandoc.Lua.Module.Types Text.Pandoc.Lua.Module.Utils Text.Pandoc.Lua.Packages Text.Pandoc.Lua.Util Text.Pandoc.Lua.Walk
I'd like to get the table changes merged first, though, before messing with this.
One complication: PandocMonad depends on Text.Pandoc.Data (dataFiles) when embed_data_files is turned on. That means that Data, and all the data files, would have to go in core. This seems conceptually wrong to me. The templates, for example, naturally go with pandoc, not pandoc-core. And some of the data files are things like the pandoc manual itself. I don't see a very clean solution to this.
Actually there is a clean solution. We could store a field for dataFiles
in the CommonState of PandocMonad. (A bit tricky though because this means that anyone using pandoc as a library will have to remember to set this field in commonstate before running readers/writers....)
This sounds really nice. We could keep the new packages in the same repo as the main app in the beginning, which should minimize friction (and preserve the git history).
Remaining problems: there probably needs to be a mechanism to decouple Text.Pandoc.Filter.Lua from T.P.Lua, or that module cannot be in pandoc-core. Also, the Lua module must be changed such that functions getReaders
can be injected, or we'd run into a dependency loop.
Can you look into those remaining issues to see if you can find a solution? I don't want to mess with these changes if it's not going to work in the end. Multiple packages in the same repo is the way to go, I think, now that the tooling supports this well -- we might even think about bringing in pandoc-types eventually.
Btw, it wouldn't be disastrous if Text.Pandoc.Filter had to go in pandoc rather than pandoc-core, because of the lua dependency. I'm more worried about potential circular dependencies in the lua stuff. E.g. I notice that Lua.Module.Utils imports T.P.Filter.JSON. I guess we could have T.P.Filter.JSON in core and the rest of the filter stuff in pandoc, though.
Yes, I'll look into it.
I guess it should it be ok to use Template Haskell to remove pandoc.lua
and pandoc.List.lua
from the data files? Including the Lua via quasiquotes in seems like a clean and easy solution, and if I remember correctly, we already depend on TH and no longer support building without it.
I fooled around a bit with the idea mentioned above for data files. I made T.P.Data an exported module, exporting initializeDataFiles
, which initializes stDataFiles
in common state with the baked in data. Problem is, you need to remember to run this every time you run a PandocMonad, and that's fragile. Maybe we'll need to provide wrappers for runIOEither
and runIOorExplode
in the pandoc package, which ensure that this initialization step is always done?
I guess it should it be ok to use Template Haskell to remove pandoc.lua and pandoc.List.lua from the data files? Including the Lua via quasiquotes in seems like a clean and easy solution, and if I remember correctly, we already depend on TH and no longer support building without it.
Correct.
I just pushed an initialize-data-files
branch which contains my idea for decoupling data files from pandoc-core. It's a bit awkward because you can't forget to add the initializeDataFiles
when you run a PandocMonad instance. But it seems to work.
I haven't been following this closely, so sorry if I'm missing something, but a few thoughts:
pandoc
should be the unambiguous name for the whole combined codebase... (well, we already have pandoc and pandoc-types, but whatever). But maybe pandoc-app
, or pandoc-readers-writers
instead?
- so the main motivation is reducing compile-time? either when working on the lua subsystem, or when compiling normal pandoc? (using package-level build cache, or...?)
At least for me, that's the primary motivation. I also like the idea of having additional clear delimitations in the code-base and serves as a motivation to untangle the dependency graph (esp. with regard to T.P.Lua).
- a potential downside would be that making refactorings that require changes to several packages becomes a lot harder, like we're already seeing with pandoc-types? Or is the idea to keep it in one git repository? a monorepo?
My understanding is that we are indeed aiming for a monorepo.
- naming: I think
pandoc
should be the unambiguous name for the whole combined codebase... (well, we already have pandoc and pandoc-types, but whatever). But maybepandoc-app
, orpandoc-readers-writers
instead?
This is tangential, but if we were to split "pandoc-executable" from "pandoc-the-library", then we could add a cabal.project.freeze file to fix the dependencies of the executable to specific versions – that would come in handy when building docker images.
Reducing the compile time for developers, and reducing the total memory required to compile pandoc (which is getting ever larger and has made it hard to build pandoc on some systems) are both motivations.
It's true that this would make development a bit more complicated, and that would have to be weighed heavily. I'm developing commonmark-hs this way, as four packages in one repository (also skylighting, skylighting-core), so I have some experience with it. It's not too bad, but you have to think about things like version numbers (if you follow the versioning policy for each package, then they will get out of sync, and this might lead to confusion; in skylighting we simply force them to be in sync but this isn't ideal).
I'm still not sure about he idea, in the end. I don't like the approach in my initialize-data-files branch and now I'm leaning towards thinking that maybe Text.Pandoc.Data and all the data files should, after all, go in core. This is ugly, because if you modify a writer and a template, for example, you'd have to modify two packages. But it's also ugly to require a special initalizeDataFiles
command every time you do runIO
or runPure
.
Yeah, if it's done as a monorepo, I think that could work... feels like as a developer, when you build master, you would just want to use the code that's currently on master as well for the other packages. But then you lose the ability to use the cache?
About how to split it, definitely the case of using pandoc the library vs the application should be part of that decision I think...
This is tangential, but if we were to split "pandoc-executable" from "pandoc-the-library", then we could add a cabal.project.freeze file to fix the dependencies of the executable to specific versions – that would come in handy when building docker images.
Nothing stops you from generating your own freeze files and using them for the docker images.
This proposal has more implications than I assumed. The potential benefits would likely not match the investments, so I'm closing this. It is probably a good idea to untangle Lua dependencies regardless.
Nothing stops you from generating your own freeze files and using them for the docker images.
That is true, I'll do that. Other projects are doing the same, Alpine for example. There might still be value in making it more likely for all binaries out there use the same dependencies.
I wouldn't mind keeping this open; it still seems possibly worth doing, I just can't decide. (Unless you have new insights not mentioned above.)
The only insight not mentioned is that I'll need to refactor HsLua as a prerequisite to untangle and improve T.P.Lua.. Refactoring will probably take quite a while, and I didn't want to leave a stale issue hanging around. I'll happily bring the topic up again once I'm confident that it could be completed in a predictable time-frame.
Might as well leave this open, though, since it contains some useful notes on what would be required.
I'm happy to report that we have funding which allows me to spend a few days on this, probably around mid March. Thanks @arfon!
Also, restructuring of HsLua is underway. So far I'm spinning off various smaller packages. This allows me to get some experiences with monorepos. More involved updates to the HsLua packages, which should help with all this, are in progress, too.
Closing, as the Lua engine and CLI program are now separate packages. The pandoc-core idea appears to troublesome for the limited advantages we'd get from it; I'm not pursuing it any further (see also #8340).
Note: #8348 opens up a path to extracting pandoc-class or pandoc-monad (T.P.Class hierarchy) without including all the data files. Not sure whether there's a point to this.
I don't yet see how we can cleanly extract pandoc-parsing, because it depends on the latex reader's types (via HasMacros). (Well, maybe this wouldn't be so bad, as it's just the types. We could even envision bringing them outside of the T.P.Readers.LaTeX tree.)
I'm going to open this just so we can keep thinking about it.
It's worth noting that cabal now supports multiple "public sub-libraries" in a single package. From the docs:
Being able to include more than one public library in a package allows the separation of the unit of distribution (the package) from the unit of buildable code (the library). This is useful for Haskell projects with many libraries that are distributed together as it avoids duplication and potential inconsistencies.
This is interesting. Are there any Hackage projects that define multiple (public) libraries in a single cabal file?
I'm not aware of any, The feature's pretty new.
I'm wondering whether it would make sense to split off parts of pandoc into a separate
pandoc-core
package. This would make it easier to move other parts into separate packages as well.My motivation here is the Lua system. It is growing quite large, but, with the exception of the
pandoc.read
function, is built only on a small part of pandoc the library. The pandoc core (i.e., T.P.Class etc) as well as the Lua system are relatively stable, so the overhead of having additional packages to maintain seems acceptable.In a similar vein: while writing jira-wiki-markup, I would have liked to have a
pandoc-parsing
library. Depending on such library would make it easy to ensure that library uses the same parser as pandoc. It could include some of the fixes and convenience functions available inText.Pandoc.Parsing
.