giuspen / cherrytree

cherrytree
https://www.giuspen.net/cherrytree/
Other
3.41k stars 461 forks source link

Implement `pandoc` support, rather than a bunch of one-off format translators. #878

Open DiagonalArg opened 4 years ago

DiagonalArg commented 4 years ago

Discussion moved over from Issue #858

Pandoc is a "general markup converter," that covers dozens of formats and supports plugs for new formats, to be written in Lua. The subject came up in the above issue, because of its extensive coverage of various markdown flavors, which can be mixed with LaTeX:

From the man page:

Pandoc's enhanced version of Markdown includes syntax for footnotes, tables, flexible ordered lists, definition lists, fenced code blocks, superscripts and subscripts, strikeout, metadata blocks, automatic tables of contents, embedded LaTeX math, citations, and Markdown inside HTML block elements. (These enhancements ... can be disabled using the markdown_strict input or out‐put format.)

Pandoc runs on both Windows and any unix-like system, where it is in most linux repositories and for MacOS, available both as a package on the pandoc site, and in Homebrew. Many projects supply interfaces between their project and pandoc. For example, Sublime and Atom. There are many more, as this is a well known and heavily used tool. It's the swiss army knife of document conversion.

Pandoc is a free and open-source document converter, widely used as a writing tool (especially by scholars) and as a basis for publishing workflows. It was created by John MacFarlane, a philosophy professor at the University of California, Berkeley. (From Wikipedia.)

If a Cherrytree to/from translator for pandoc were developed, then that Lua script could be included with Cherrytree. With pandoc as an optional dependency, it could become the one tool needed for anything from Makdown + LaTeX to rtf & html to pdf & epub, etc. etc. etc.

Right now would be the time to do it, as it would save the work of writing a markdown translator, while at the same time offering a much more advanced markdown dialect including, for free, LaTeX.

txe commented 4 years ago

I don't know, I would say no than yes

I can see why some users want to convert notes to different formats but it's still unreasonable to keep notes in Cherrytree to export them again and again into style-rich formats like Latex, rtf and others. In case of converting, user can use intermediate formats we already support (such as txt, html and, I hope, markdown) and then can easily convert from the given format to something else by using pandoc.

We can try to cover process automation by adding execution of some user scripts after exporting.

If users want better support by pandoc than just using intermediate format, then adding a plugin to pandoc to support cherrytree formats should fix this issue. But it is a question to pandoc, not to cherrytree.

I don't think dependence from pandoc to just convert markdown is good idea.

ForeverRainbow commented 4 years ago

I think the main thing which pandoc support would add is a more comprehensive version of Markdown, which as @txe already said we should be able to implement mostly on our own and it would not be useful for the real-time formatting (since cherrytree stores everything in its own format there isnt much point with a seperate markdown preview pane or similar).

What I do think would be good though is a pandoc filter for converting to/from cherrytree XML so that conversion would be a lot easier, this can be implemented as a separate project through I think and then added to cherrytree source later or just left as an option I think.

datavectors commented 4 years ago

If we could have the xpath to an open CT node (in an environment variable) we can apply pandoc (and other tools) for conversion. Earlier I raised an issue about showing xpath but better to have the xpath in a variable. With that we can run scripts in codebox. I am experimenting now with current CT.

txe commented 4 years ago

What I do think would be good though is a pandoc filter for converting to/from cherrytree XML so that conversion would be a lot easier, this can be implemented as a separate project

I thought pandoc supports plugins, so there is no need for a separate project? Just one more plugin for cherrytree format as, for example, zim did for its own format. After that the next natural step will be adding 'Export to Pandoc' in cherrytree

ForeverRainbow commented 4 years ago

@txe The Zim writer is implemented in Haskell which I might be able to do for cherrytree but then it has to be merged with pandoc upstream. What I was talking about was a filter file so you can run pandoc ... --filter ./ct_filter and it will be able to convert from/to a cherrytree xml file (at least I think), reason I said separate is because it doesnt need to be part of cherrytree's source until/unless cherrytree supports it within the UI (since while its being developed it can just being something which can be run externally by the user).

The filter files can be written in several different languages with the main ones being Haskell and Lua but python is also an option (and may be better for maintenance)

txe commented 4 years ago

Ah, I see

datavectors commented 4 years ago

Has the point about xpath been missed? Anyway I'm looking back to earlier python experiments where I dump CT node and parse it with python. I can then run pandoc or whatever.

txe commented 4 years ago

@datavectors , I wrote about it in the other issue?

ForeverRainbow commented 4 years ago

@datavectors that would only work for ctd documents and would also clutter the users environment variables if cherrytree was just running, which I personally do not like the sound of, but yes in theory it could be done (I think?). Quite a niche usage though I think

datavectors commented 4 years ago

My thoughts are to deprecate the sqlite db option, concentrate on XML processing and use eXist-db as the backend. Using python I can post CherryTree documents as eXist-db collections. All XML. Queries use xpath .. hence my earlier interest. But I can just parse the CT document. I don't regard this as niche.

txe commented 4 years ago

@datavectors , so as far I understand, you already know how to parse data and your concern is only how to conveniently get xpath. If this is so, I wrote in #799 a few times about, for some reason you did not answer there. If those solutions are not good, we can come up with something better.

DiagonalArg commented 4 years ago

@txe

Latex, rtf and others are used by user to create rather complicated and stylized texts. I can see why some users want to convert notes to different formats but it's still unreasonable to keep notes in Cherrytree to export them again and again into style-rich formats like Latex, rtf and others. In case of converting, user can use intermediate formats we already support (such as txt, html and, I hope, markdown) and then can easily convert from the given format to something else by using pandoc.

@ForeverRainbow

I think the main thing which pandoc support would add is a more comprehensive version of Markdown, which as @txe already said we should be able to implement mostly on our own and it would not be useful for the real-time formatting (since cherrytree stores everything in its own format there isnt much point with a seperate markdown preview pane or similar).

It would not only support a more comprehensive version of Markdown, it would support it in combination with LaTeX. As do, for example, some of the StackExchange sites. Some of us use mathematics as we do English or other natural languages.

So, that is my main personal agenda here. A desire to take notes that include readable mathematics. I did point out in the previous thread that this could be coded directly by using MathJax or KaTex, but I do think depending on pandoc would be easier.

datavectors commented 4 years ago

A companion editor I use is Atom. In there I have markdown-preview-enhanced package which previews markdown. For maths perhaps Rmarkdown is useful. Anyway I always have CherryTree and Atom running together in a toolchain. Perhaps there is some harmony there.

MJimitater commented 4 years ago

@DiagonalArg I have the same problem, I don't use Atom a lot, since I grew so accustomed to CT (I guess I could give it a try though), I always try putting math formula using copy&pasting unicode symbols! But of course this is only a work-around, a tedious one too - an option for having LaTeX in CT I also find intriguing

txe commented 4 years ago

Just to make it clear for myself, so these statements mostly to @DiagonalArg:

Also, python cherrytree has more export formats then pandoc, and they still need to implement, otherwise it will be drawback

txe commented 4 years ago

Well, I begin to think maybe pandoc will be better than the custom markdown support.

ForeverRainbow commented 4 years ago

@txe I think supporting pandoc would be great, I had a crack at writing a custom writer in Lua earlier and ran into some issues. Pandoc seems to use bottom-up parsing so for example <tag <tag2/> /> is fed first as tag2 then as tag which is an issue because cherrytree does not support nested tags (which may be easier to fix than trying to work around the parser actually?).

Conceivably pandoc could also be used for the formatting of text in the editor (so write in markdown, latex, etc). Only problem with this I can see is deciding when to feed it, presumably the user would hit a key to transform the whole document but it might be a bit janky.

Also I was wrong about the filter by the way, if we want a custom format we have to use either Haskell (and build a custom version of pandoc or merge it upstream) or a Lua script, I think the lua script is the better option here

txe commented 4 years ago

Conceivably pandoc could also be used for the formatting of text in the editor (so write in markdown, latex, etc). Only problem with this I can see is deciding when to feed it, presumably the user would hit a key to transform the whole document but it might be a bit janky.

Yeah, it's not easy to achieve

Also I was wrong about the filter by the way, if we want a custom format we have to use either Haskell (and build a custom version of pandoc or merge it upstream) or a Lua script, I think the lua script is the better option here

Right, i read a bit about it, and filters are not what we need. We need a converter from xml to pandoc AST and it can be written in any language, e.g. python and maybe python filters can give some ideas. Eventually, working code can be rewritten and moved to pandoc source.

So, the converter take a file or xml input and then outputs pandoc AST, so it can look like

coverter -f file.ctd | pandoc -f json
txe commented 4 years ago

I don't see much pandoc AST documentation, but it can be figured out from python filters or from pandoc output, e.g. pandoc -t json

ForeverRainbow commented 4 years ago

What I was talking about was actually the option to specify a custom format file with -t file.lua, which has a bunch of functions like Strong, Header, etc which are called with the input data and need to return the output format. Modifying the example I got mostly complete pandoc -> cherrytree xml formatter... except it cannot handle multiple tags because it will just wrap in in multiple xml tags instead of adding an attribute. I am actually not sure this can be fixed with the way lua file gets fed so we may have to convert to pandoc AST but that will be a lot more work.

Edit Just realised I'm being dumb. Yes we are going to need to convert to pandoc AST afaik in order to export things, to import things we can hopefully use a Lua writer

steveno commented 4 years ago

My thoughts are to deprecate the sqlite db option, concentrate on XML processing and use eXist-db as the backend.

Please don't. SQLite is one of, if not the, most thoroughly tested, not to mention most widely used, database in the world. If moving the XML version of cherrytree to a document store is something you want to do, sure, but there's zero reason AFAICT to drop SQLite.

txe commented 4 years ago

@steveno, sqlite are not going to be dropped or be second-class citizen. @datavectors just said about his approach to parse data.

txe commented 4 years ago

@ForeverRainbow, are you going to do it?

ForeverRainbow commented 4 years ago

I can try, but not 100% sure I can

DiagonalArg commented 4 years ago

@ForeverRainbow

Conceivably pandoc could also be used for the formatting of text in the editor (so write in markdown, latex, etc). Only problem with this I can see is deciding when to feed it, presumably the user would hit a key to transform the whole document but it might be a bit janky.

My hope was that it's integration into CT we be as:

  1. Input markdown + latex into CT

  2. In the background have that converted to CT's XML via pandoc

  3. View that XML either in either/or:

    • a companion CT window
    • via a hot-key that allows us to switch back and forth, rendered/unrendered. (Perhaps the option for both.)

This, I understand, is how most markdown editors function. (As commented here by @dvdgsng .)

(It could of course also be used for import/export to various formats.)

datavectors commented 4 years ago

Very sorry if I caused a minor panic by suggesting use XML in lieu of SQLite. This is just my personal preference.

On the matter of markdown preview would it be feasible to integrate markdown-preview-enhanced .. like other editors Atom and VSCode? But note dependency mume.

ForeverRainbow commented 4 years ago

@txe What do you think of instead of trying to create a custom writer for cherrytree files, instead just use html and send it through the existing import/exporter within cherrytree. If its not for live formatting and just a preview pane then I think it should work fine

txe commented 4 years ago

That's actually a good idea)

DiagonalArg commented 4 years ago

Edit: Ah, so never mind. Reading the other thread, I 'm getting that this is an alternative to the Lua coded translator, which would make the below possible anyway. Still, I won't delete in order to retain the expression of enthusiasm.

If that's actually a good idea (it looks so to me, but I'm not coding anything, so can't be sure), then then it appears to me that any format that can be converted to html by pandoc would be something I could then use in CT.

So beyond markdown/latex, I'd be able to use lightweight markup, textile, txt2tags, org-mode, or various wiki variants, and have them displayed. I might even write man pages; and while CT is no IDE, there'd nevertheless be the option to do restructuredtext for python documentation, and you could even produce haddock from your haskell source!

Not that I personally want to do all that, but nevertheless, I'm enthusiastic! :)

ForeverRainbow commented 4 years ago

@txe I was thinking that the easiest way to mark text to be send through pandoc would be to have it as a rich text attribute, something like "format=pandoc", do you know if this will break anything in the python version?

txe commented 4 years ago

I don't know, I've never done that before. On the other hand there are only few places where attributes are used, they are easy to find and check what trouble it can cause.

ForeverRainbow commented 4 years ago

Pandoc exporting is going to be pretty difficult, since there needs to be some way to detect which text to send through pandoc and which is internal ct formatting, and then perhaps even more complicated is joining them together.

The best and I think only sane way is to create a new node type (e.g "markdown_text") which operates like a plan-text node in terms of cherrytree formatting but has the option to be exported to pandoc and later have a preview pane.

Actually I have just realised that cherrytree already has this in the automatic syntax highlighting option.

DiagonalArg commented 4 years ago

@ForeverRainbow - So just out of curiosity, the Lua CT plugin is not a viable approach?

ForeverRainbow commented 4 years ago

@DiagonalArg I do not think so no. The approach I am currently running with is using html as the interface language since it is far easier to implement and maintain. A custom Lua writer is possible (I think) but due to some assumptions made and the way that the cherrytree xml is structured, it will be quite difficult

ForeverRainbow commented 4 years ago

Now that pandoc importing and pandoc html exporting are done, the question is what form this "preview" of formatted text would take. I think a preview pane would be the most useful option, and I actually tested a multi-pane system a few days ago. I think if a preview pane is going to be implemented then #868 should be done first and then a preview pane should be a subset of whatever we come up with to represent each pane in the editor

ajaxStardust commented 1 year ago

I know very little about Pandoc, but I recall it being used it probably more than one software app that you'd probably respect. There must be something about it that people find appealing. Be it implementation, quality, or whatever. I used to like Flashpaper. Haha. It would still be good for local files! I think it was PDF quality. Sadly, I don't think you can even use it anymore. I have several .swf files that I'll probably never read again.

You raise a good discussion.