spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
779 stars 129 forks source link

Cross Compile with wtf_wikipedia.js to other Document formats #80

Closed niebert closed 6 years ago

niebert commented 6 years ago

Parsing of Wiki markdown generates a syntax tree. Is there a recommended way to create a output format other than plain text. Want to use document conversion via wtf_wikipedia.js with generated syntax tree e.g. to create LaTeX and other output formats, similar to PanDoc https://pandoc.org/try/ ? e.g. convert "===My Header===" into LaTeX syntax "\subsection{My Header}" . Thank you very much for developing wtf_wikipedia.js and sharing the code.

spencermountain commented 6 years ago

hey Engelbert! whoa, I've never seen pandoc. Thank you for sharing this.

yeah, this is a really neat idea. I'd thought about outputting the parsed data back to html or markdown, but I'll admit It is a little funny, when html output is the thing the wikimedia parser does well. You may wanna go that route, if you require cosmetic things like sortable tables and stuff, which this library ignores.

But yeah, I can definitely see a use for outputting cleaned-up markdown/html from the arrays of sentences and stuff. That would be a fun thing, and I'd be happy to do it.

how would something like this be?

wtf(myWikiText).toHtml({links:true, tables:true, formatting:true, infoboxes:true})

cheers

niebert commented 6 years ago

thank you for your reply. I would recommend to parse into a syntax tree, similar to the DOM tree in the browser (root node, nodes have children, e,g, subsections are e.g. childrens of sections,...). This abstract syntax tree (AST) could be generated by your wtf_wikipedia.js. A tree visitor runs over the tree nodes of the AST and exports to a specific output format.

 var ast = wtf(myWikiText).toAST({links:true, tables:true, formatting:true, infoboxes:true});
 var ast2tex = new AST2Latex(); // AST visitor to create LaTeX 
 var latex_out = ast2tex.convert(ast);

It could be possible to init wtf with an visitor

 var ast2tex = new AST2Latex(); // AST visitor to create LaTeX 
 wtf.initVisitor(ast2tex);
 var latex_out = wtf(myWikiText).compile({links:true, tables:true, formatting:true, infoboxes:true});
 var ast2reveal = new AST2Reveal(); // AST visitor to create RevealJS presentation 
 wtf.initVisitor(ast2reveal);
 var reveal_out = wtf(myWikiText).compile({links:true, tables:true, formatting:true, infoboxes:true});

See PanDocElectron in Wikiversity. A simple Wikiversity article e.g. Math Lecture about Topology can be converted in a RevealJS presentation directly from the WIkiversity source. I created PanDocElectron as multiplatform Electron Application for Linux,Windows and MacOSX, but the installation is to complicated. wtf_wikipedia.js will allow to perform a wiki source conversion directly in a browser without any installation. Browserify the whole project in NodeJS together with

cheers

niebert commented 6 years ago

Dear Spencer, thank you very much for the solution

wtf(myWikiText).toHtml({links:true, tables:true, formatting:true, infoboxes:true,math:true})

will be a great feature. Parsing the output HTML source into a DOM tree will be easy, because the browser does it, JQuery can be used, or even innerHTML. Good starting point and run the cross compilation to other formats.

The AST example in my comment above was not meant to be implemented by you. I just want to explain, how I want to use the HTML output for cross-compiling and a plugin-concept for output formats. I plan to use HandleBars compile functions to extend the methods of DOM nodes. Application of the compile method on DOM root node of HTML document creates the output format, by calling the complie functions for all children.

Thank you very much, for your sharing and developing __wtf_wikipedia.js__.

spencermountain commented 6 years ago

hey Engelbert, I've gotten markdown and html outputs working in the 2.6.1 version. This is how it works:

var wtf=require('wtf_wikipedia')
wtf.from_api('Aldous Huxley', 'en', function(wiki) {
   var md = wtf.markdown(wiki);
   console.log(md)   //view the rendered markdown at https://stackedit.io/app

  var html=wtf.html(wiki)
  console.log(html) //regular old html output
});

i'm happy to do a proper AST. I've never done that before. lemme know if this works ok for you https://runkit.com/spencermountain/5a90bff3fb73ad0012f5f476 cheers

niebert commented 6 years ago

You are great, thank you so much.

Did a minor feature analysis for WebODF http://www.webodf.org/demos/

I am just starting to realize how powerful your library can be. A full webbased Office document generation.

cheers, Bert

niebert commented 6 years ago

The following code can create a DOM tree from generate HTML code (see also https://gojs.net/latest/samples/DOMTree.html ).

var dom_tree = document.createElement("body");
var html_code = "<b>hello</b> World!";
dom_tree.innerHTML = html_code;
spencermountain commented 6 years ago

thanks! yeah, just a heads-up - I'm gonna change the api a bit in the next version, so that if you want to get the html, and a list of categories, you won't have to parse the document twice. It'll be something like

var doc = wtf(wiki)
var html= doc.toHtml()
var cats= doc.categories() 
//...and so on

cheers!

niebert commented 6 years ago

Great perfect. One thing I have'nt understood properly. How do you want developers to extend your library

spencermountain commented 6 years ago

hey, yeah this is a great question, and you have good timing. This should definitely be part of the redesign - a way to influence the parser, and also a way to easily extend the functionality. hmm. Gimme a couple days to put the pieces into place, then I'll get your help with this. Things are currently pretty messy, but doing things like toLatex() should become substantially easier in a few days.

I'd be happy to include latex as an output format. Great idea

niebert commented 6 years ago

If you want me as collaborator, I willing to help. First I would support you in documentation in README.md later I would support you with additional support formats.

see https://niebert.github.io/Wiki2Reveal//wtf_wiki2html.html how I converted the markdown to HTML quick an dirty with https://github.com/niebert/Wiki2Reveal/blob/master/docs/js/wiki2html.js

spencermountain commented 6 years ago

hey, very cool. Yeah, of course, that would be good. I'm re-organizing the library a great deal in the dev branch which you can check out. The basic idea is that doc = wtf(string) will do basic parsing, but things only get fully parsed when they're needed, and we can continue operating on the doc as a class, with all sorts of helper-functions.

The branch is moving around a lot still, but you're welcomed to join in. Lets do some documentation once it's stable. You can start on a latex output, if you wanted, with our other outputs, in src/output.

You'll notice that instead of creating a gigantic json file, it's starting to create a document with methods like .sentences() .sections() and things - it's a lot cleaner, as the page gets bigger and weirder

spencermountain commented 6 years ago

feel-free to make a pr with some latex output. The next few days are busy for me, so I won't conflict anything. You can see some of the tests are already passing on the dev branch. It's in reasonable shape ;/

spencermountain commented 6 years ago

hey Engelbert, i'm just heading out tomorrow on a 2-week vacation (to japan!). I didn't get around to releasing the dev branch, but will do so when i return at the start of April. sorry about that cheers

niebert commented 6 years ago

Have a great time in Japan.

spencermountain commented 6 years ago

thanks @niebert !