pragdave / earmark

Markdown parser for Elixir
Other
869 stars 136 forks source link

Feature idea: Output manipulation #145

Closed pareeohnos closed 5 years ago

pareeohnos commented 7 years ago

Apologies if this is something earmark can already do, or there is a way around it but I couldn't find anything on it.

I'm woking on an application that is using earmark, and I want to be able to add links to a page that are anchors to the headers produced by earmark. As far as I can tell, as of right now I would basically need to produce the HTML, and then edit the result manually which is fine but seems a bit long winded.

Unless I'm mistaken, using the plugin system wouldn't really work, as there are multiple ways to define a header, either via the # syntax, or underlining the word with = symbols, and in the latter case, registering a plugin for = would simply result in a line

Proposal

Introduce a new configuration option that works in a similar way to the plugin system, but instead of giving you the unprocessed content, gives you the already processed output that can be manipulated. These could be registered on a per tag basis, so I could in this instance register the processor for h1 and h2 tags for example.

Obviously this makes it sound a lot simpler than I imagine it is but you get the idea.

Edit: I realise there is the IAL extension for adding HTML attributes, however this puts the responsibility of doing so on the author of the markdown. The project I'm working on needs to do this automatically without any input from the markdown being processed.

RobertDober commented 7 years ago

Do I understand correctly that what you eventually want is this?

# Hi
  <h1 name="hi">Hi</...>

I like the 💡 👍

IIUC you would like to achieve this with Renderer Plugins instead of a fixed option?

Edit: ;) You are completely right IALs are not the tool for the job.

Redit:

The plugin system would work in the sense that you could write Markdown like this


$$ # Hello

with %Earmark.Options{plugins: %{"" => YourAnchorPlugin}} which is even more intrusive than IAL.

pragdave commented 7 years ago

Can't you do this with an inline ial?

On Sat, May 27, 2017 at 5:37 AM, Robert Dober notifications@github.com wrote:

Do I understand correctly that what you eventually want is this?

Hi

Hi I like the 💡 👍 IIUC you would like to achieve this with *Renderer* *Plugins* instead of a fixed option? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .

pareeohnos commented 7 years ago

@RobertDober Yeah that's exactly it. Some way I could add a plugin which would simply let you manipulate the already generated html (or a pre HTML structure). I guess it could provide two options. One whereby headers are automatically given an id that is auto-generated based on its content which would cover my requirement personally, and a second that provides a more comprehensive way of manipulating things.

The plugin system would work in that example, however that again would require the author of the markdown to know about this and to put it in, which in the case of the system I am working on is not something I really want to do if I can avoid it.

@pragdave Whilst I could use the IAL this makes it the responsibility of the author of the markdown. The system I'm writing allows users to upload their own markdown files, and the system itself will generate a page where it will have links to various sections in the markdown file. Ideally I want this to be automated so users aren't required to make any alterations to their potentially already existing markdown files.

Sorry if I misunderstood what you mean when you say inline IAL, is this something different?

RobertDober commented 7 years ago

@pareeohnos I was studying the code (bad memory) and it might be nice to add something like Earmark.as_tree(..., %Options{}) as some transformation of the internal block structure. There are many possible variations about the degree of output format agnosticism, it is such a pitty, IIANM, that there is no standardized (or even close) Markdown AST. @pragdave I'd love to play around with this, but without interfering with the current code that produces Earmark.as_html. @pareeohnos if Dave agrees and I find time we might have some experimental features as Earmark.as_tree/ast/uname_it;), trivial to transform to hmtl and/or with a plugin enabled transformer available.

But it will take time I am afraid.

pareeohnos commented 7 years ago

Ah that might be a better approach! I'm currently parsing the resulting HTML to determine the headers that are there, but if given an AST type structure before the HTML then I can manipulate the headers to add id's, AND get a list of those headers in one go, that would be much better!

I wouldn't expect a very quick turn around don't worry :) would be happy to help out with this as well. Could always do some of it along side the project I'm working on

RobertDober commented 7 years ago

I have created a new branch as-ast-implementation that shall serve to try to implement this.

I have generated, failing, tests in test/ast that are generated from the acceptance tests (with Floki, hence the somehow strange attribute syntax but we can easily convert to Keyword later.

@pareeohnos if you want to participate in this, PRs vs. this branch would be very welcome of course.

pareeohnos commented 7 years ago

@RobertDober awesome thanks, I'll take a look and start trying to integrate it with my project :)

RobertDober commented 7 years ago

@pareeohnos not sure I was clear enough, there is only the spec right now.

pareeohnos commented 7 years ago

@RobertDober yeah I got that don't worry. Sorry I meant more I'll change my dependency, and start filling in the holes as I go :) too tired to speak clearly at the moment 😄

pareeohnos commented 7 years ago

@RobertDober just had a thought after looking through the code. Since the Earmark.parse/2 function is already exposed, would it be easier to simply create an alternative implementation of to_html that accepts the block structure and context rather than a string?

Obviously this isn't strictly an AST, however for the AST to be useful, it would need to also have an input that accepts the AST instead of a markdown string. In order for this to work with the existing code, it would need to be able to both convert the internal block structure into the AST, and also be able to convert back from AST to the block structure.

Since the block structure is being made public, is it strictly "internal" to the point where it warrants the overhead involved in two transformations of the data? I've tested it out, and essentially introducing one more function

markdown = """
# Hello world
- test
"""

{ blocks, context } = Earmark.parse(markdown)
blocks =
  blocks 
  |> Enum.map(fn(b) ->
         case b do
           %Earmark.Block.Heading{ attrs: attrs } -> %{ b | attrs: %{ "class" => ["heading"] } }
           _ -> b
         end
       end)

Earmark.to_html(blocks, context)
# For now, this works
# context.options.renderer.render(blocks, context)

Not the cleanest code, but it does work. Probably quite a bit easier than implementing a full blown AST, or would that still be preferable?

RobertDober commented 7 years ago

Interesting thoughts

of course exposing an AST is not much more than fixing the internal representation. However I do feel very uneasy about the low access point. This shows in the fact that your code is really inside Earmark. I find it much more conforting to expose a much more Renderer, in our case HTML, related AST where all the conversion has been done.

Then again Earmark should expose maybe the final renderer for your transformed AST.

The downside would be that you cannot do a simple map but that is also it's strength, you could make context informed choises in your converter.

May I selfpromote my Traverse library in that context.

There are really quite some possibilities - although maybe not an infinite combination of infinite numbers ;).

Thanx for your input in any case.

pareeohnos commented 7 years ago

Yeah I'm not keen on the idea of using the internal block structure, as that would leave my code fragile to any changes in that structure, whereas a public AST structure would make that substantially safer.

I made a start by effectively replicating the HtmlRenderer module but producing the AST structure instead which should work, it just needs something to reverse the transformation later. Do you think this would be better as a different version of to_html where the AST is supplied, or a new function, something like from_ast or similar?

RobertDober commented 7 years ago

I was rewriting lots of code this WE to fix #147. It stroke me that rendering an HTML AST really would not be much more work than rendering to text.

If I can do this, BTW sorry if you worked on the Renderer recently the upstream just changed substantially and it is not over yet.

Once it is I really like the idea of rendering to a HTML Datastructure like Floki's and then transforming that to text would be little work.

For you that would mean

    {status, ast, messages} = Earmark.Interface.html(input, options)
    ast' = do_something(ast)
    Earmark.Renderer.to_html(ast')

Stop Press: I have merged the branch for this if you want to pull from upstream

pareeohnos commented 7 years ago

That would be perfect, exactly the sort of thing I've been looking at implementing. Is the Earmark.Interface.html replacing Earmark.to_html then?

RobertDober commented 7 years ago

I suppose you mean as_html, but no.

However if now it invokes HtmlRenderer.render eventually it would be the code above with do_something being a NOP.

pareeohnos commented 7 years ago

ah sorry, as_html is what I meant. ok so I'll wait until you're done with all the changes before carrying on with things. No point doing any more if the code is changing massively. Any ETA on when the changes would be ready?

RobertDober commented 7 years ago

Unfortunately no, however what you could do, if you want, is to duplicate test/acceptance into test/ast_acceptance(or so) and rewrite all acceptance tests by changing the calls to Earmark.as_html to Earmark.Interface.html(easy) and the expected string to an AST (hard).

I guess using Floki to automate the task is a good idea.

Once that in place I might have simplified HtmlRenderer a little, which is, as you can see quite complicated (at least for my brain).

Now this is much work and I wouldn't dare asking you to do this, but well if you could help :blush:

pareeohnos commented 7 years ago

Isn't that what you've already done on the as-ast-implementation branch? All the tests already in test/ast are the same as those in test/acceptance? The only difference being that they currently call Earmark.as_ast and not Earmark.Interface.html so I can change those easily enough.

RobertDober commented 7 years ago

OMG senility lurks only two blocks away from here.... 😨 If you might change these, that would be great.

pareeohnos commented 7 years ago

ok, all AST acceptance tests have been updated on my fork :)

RobertDober commented 7 years ago

Great. Whenever you want I will merge PRs against the as-ast-implementation branch.

Is there a way to give commit rights for branches only? That would make our life easier.

pareeohnos commented 7 years ago

Great, will setup a PR now. I don't think you can grant access to a single branch, but you can do the opposite I think. You can protect branches, and only allow certain collaborators to make changes and it prevents some other stuff https://help.github.com/articles/about-protected-branches/

RobertDober commented 7 years ago

Very interesting indeed. In our case that would not work, because @pragdave is the admin and if we protect master and maybe a couple of other branches even I could not push to them anymore 😢 .

So I will be greatfull for PRs. Actually pulling in changes from origin or from upstream if you name this repro in that way does not make much difference I guess.

pareeohnos commented 7 years ago

ah yeah if it's a personal repo then you don't seem to have control. Organisation repo's let you protect it, and then whitelist users. But yeah, PRs work just fine I think :)

Anything else that needs doing now?

RobertDober commented 7 years ago

@pareeohnos I was starting to sketch out the code, would you be interested in going on from here? If so I leave it alone, if not I will progress but very slowly I guess.

pareeohnos commented 7 years ago

Yeah would be happy to work on this. Have you got a high level overview of what you had in mind and I can try and implement it in that way? Don't wanna go and write it in a completely different way to what you had planned

RobertDober commented 7 years ago

The basic idea is to mimic what HtmlRenderer.render does it is quite complicated with inline rendering. I was thinking a lot about simplifying but could not really wrap my head around it.

I am still dreaming of a v2 in which all lines are tokenized, but that is beyond the scope of this as we shall definitely start by modifiying the blocks as delivered by the parser in this version.

I appreciate your spirit to keep it in the same mindset but if you see simplifications that eluded me I would be happy and glad to discuss them with you.

pareeohnos commented 7 years ago

sounds good, that's pretty much what I'd started doing initially. Do you think this should be something that is configurable in the same way the renderer can be set in the Config struct, or just keep it as something internal?

ha yeah definitely something for V2 :P

I'll have another run through the code and try and get my head around things then get stuck in :)

RobertDober commented 7 years ago

If you want to advance as fast as possible you probably should stick with that.

OTOH If you want to write plain defs for each renderer and just call them that would be replacing context.options.renderer.link by e.g. Earmark.Renderer.Html.render_link (of course importing it, but just to remind myself of the module layout) it is something I was thinking about for quite some time.

That said Dave's idea was great to carry functions around and allow them to be replaced the question is however: Did anyone ever use that? Or is that something someone hopes to use?

I believe that it is good to experiment and in the end if Dave prefers his way we will humbly keep it, and get stuck within it ;).

I am looking forward to your PRs.

pareeohnos commented 7 years ago

I'll stick with that for now, could always refactor later if we feel it needs changing or if Dave doesn't like it :)

pareeohnos commented 7 years ago

hmm taking a bit longer than I'd hoped. Got the bulk of it done, but the inline module assumes (understandably) that the result of the renderer is HTML, and as such has a bunch of code that manipulates that HTML which breaks the AST implementation so have to get around that

RobertDober commented 7 years ago

Yeah the recursive depth and width of the code base is complciated to handle. But I am afraid we are dealing with domain induced complexity. Even Dave's original code was quite complex and I added by introducing pure error handling and some features...

Not sure I will have much time but I'd gladly look at your code...

pareeohnos commented 7 years ago

It's not so much the complexity of it but rather the assumptions it makes (or rather, the assumptions I made). I had assumed that the HTMLRenderer module was the only thing responsible for rendering HTML, which is mostly true, however the Inline module seems to do a little of this as well, and because of it assuming that the renderer will always be returning HTML and not anything else, it is breaking.

I've covered a few things by simply checking the structure of the response from the renderer and it's mostly working, but it's starting to feel a bit hacky at the moment :(

pareeohnos commented 7 years ago

Question in regards to the expected output. In some of the tests, the expected result has all HTML converted to AST, even if it was in raw HTML to start with. However, in one of the tests (footnotes_test) the result it is expecting contains the raw &nbsp; value. Is this correct? Given that the AST should contain unconverted HTML, I'm not sure if this is correct?

Note that this applies to any HTML entity starting with &, such as "&#x21A9; again in the footnotes_test module. The HTML renderer escapes all of these so that they persist in the final output, however to convert them into AST that wasn't really possible (although I could easily enough change the escape function to only escape the entities)? Not sure what your preference is here on what to do.

I'm inclined to escape the HTML entities so they remain intact, as once in AST, it would be impossible to know if a space should be converted to an &nbsp; entity or not, and given it's not an HTML element I don't think it should really be returned as an AST tuple either.

RobertDober commented 7 years ago

Sorry for the delay, quite busy now, i'll take a look at this anomaly to eradicate it if necessary. However my first prio right now is to check if we are green with Elixir 1.5.

pareeohnos commented 7 years ago

no problem :) yeah that's fine, I've mostly sorted it now. Down to 6 failing tests :)

RobertDober commented 7 years ago

Well, naïvely, I would expect the AST to have legal HTML nodes. That implies of course that the text is escaped before creating the AST.

Furthermore, IIRC we do not provide an AST --> HTML converter for now, all we will be doing is to expose the AST in a way that makes it easy to convert it to text, or did you plan to include such a converter in your PR?

Unless the latter is the case I strongly advocate for escaping before AST exposure.

Does that make sense?

pareeohnos commented 7 years ago

Yeah that was my thought, and that's what I went with. All entities are being esaped, all normal HTML is being left as is, and then parsed into AST using Floki.

I was going to give 2 PR's I think. The first being the generation of the AST, and the second being one that converts that to HTML.

pareeohnos commented 6 years ago

So REALLY long time of silence (sorry!) and I'm having another look at this. I actually got about 95% implemented but had an issue where a couple of tests were always failing, because the changes I was making would always be wrong for either HTML or AST and I never found a nice solution.

Having briefly thought about this, I think a slightly larger refactor might be needed for this to work without it being a bit of a hack. Right now modules like the Earmark.Inline module assume that the output is going to be HTML, and are therefore geared to outputting only HTML. Right now I worked around this by adding extra arguments that set whether or not it is AST, but this to me feels really quite hacky, and also prevents any future renderers being added easily as you'd be in the exact same situation.

So I propose a larger change to implement the AST renderer - allow custom renderers to be defined. Similar to how Earmark allows plugins to be defined by the user, this could be extended to allow custom renderers. Earmark could either ship with HTML/AST built in, or as separate repo's that are dependencies, either way would work. But instead of Earmark assuming that HTML is the output, it instead just hands it to a renderer of whatever type the developer likes.

Obviously 99% of the time this will probably just be HTML, but you never know. Now I want AST, later someone might want to use it to render something else - perhaps have a markdown to PDF converter for instance (god knows why) but you get my point. It would effectively turn Earmark into a markdown-to-whatever converter rather than just HTML.

Thoughts?

RobertDober commented 6 years ago

Sounds all quite reasonable to me, it does not really matter that 99% of the renderes will be rendering HTML, what I really like is to expose the AST so that custom renderes can be easily be written and modified too.

Just recently @claytongentry exposed a way too close dependency between the built in HtmlRenderer and the code it uses, e.g. Earmark.Inline in #188. In order to remove this connascence an exposed AST would just be great.

I would love to see all the work you have already done, I just created an issue branch, in which I would love to test/help with your ideas.

I therefore encourage you to PR against this branch

pareeohnos commented 6 years ago

I’ll setup a PR shortly. Although if the main goal initially is to expose an AST then it might be more straight forward to instead expose the internal parsed structure. Whilst it’s not strictly an AST it does describe the structure that needs to be output and could be used in exactly the same was as an AST and would also remove the need for a step to turn this structure into AST. Any parser could then just be given this to do whatever it likes.

Theoretically the main change that would be required as well would be to change the Inline module and others to return new structs instead of HTML.

Sent with GitHawk

pareeohnos commented 6 years ago

ok @RobertDober I THINK I'm basically done with this. All green with AST and HTML rendering. Will setup a new PR for you to look at

RobertDober commented 5 years ago

It is a little bit sad, that it took me more than 2 years, but eventually it is done.

Earmark.as_ast has been merged into master, 1.4 release upcoming this week, I hope.

pareeohnos commented 5 years ago

ah awesome! Nice one