pragdave / earmark

Markdown parser for Elixir
Other
860 stars 135 forks source link

Thoughts into the Earmark.as_ast #300

Closed josevalim closed 4 years ago

josevalim commented 4 years ago

So it turns out that, in an interesting plot of destiny, ExDoc may end-up using the as_ast API. So thanks for working on that @RobertDober!

The reason is because we are looking into rendering documentation from other languages and they likely won't provide Markdown but likely an HTML AST.

We are not sure how their AST will look like but I wanted to start this conversation before. In any case, it is most likely:

{:a, [href: "..."], "text"}

Therefore, in order to standardize, it would be nice to have an option to return atoms for tags and attributes. Note I wouldn't change the default, since returning binaries is the safest option for Earmark. But if you do add said option, then {:comment, [], []} as today is ambiguous, so we would probably need to stick with Floki's API.

However, none of this is confirmed, so please hold on working on it. I just wanted to get the conversation started and maybe ask if you can hold until the final version of the as_ast API is released until we confirm the format of our choice.

Thank you for maintaining this very important lib!

RobertDober commented 4 years ago

Great, that is why I have introduced the AST slowly and as experimental. My roadmap is to have a stable AST with 1.5 or 1.6. (1) and there is no hurry.

I love to have that kind of feedback, and I am looking forward to more exchanges on this. I could definitely render tags as atoms , I would have done it by default, had it not been for Floki.

I also have some ideas about resolving the ambiguity with {:comment, [], []} e.g. using {:comment, "string"} or {nil, ...} for comments, although it is ugly somehow.

(1) I am thinking about using odd minor versions for unstable releases and even for stable ones in the future, à la Node.js, what do you think about that?

josevalim commented 4 years ago

Great!

I would have done it by default, had it not been for Floki.

And generating atoms from user input is not the safe too. So even if someone does HTML sanitization, if you generated atoms by default, they would be vulnerable to atom leakage attacks.

(1) I am thinking about using odd minor versions for unstable releases and even for stable ones in the future, à la Node.js, what do you think about that?(1) I am thinking about using odd minor versions for unstable releases and even for stable ones in the future, à la Node.js, what do you think about that?

This is not supported by SemVer and it is nothing something that ~> 1.2 in Elixir would know about, so I don't think it would be a good fit.

pragdave commented 4 years ago

Although how likely is an atomic attack? (Sorry)

In order to do damage, they'd have to feed megabytes of stuff into the code. That would either be from local files (in which case they already have access) or (if remotely) over HTTP. So simply limiting upload sizes would ameliorate that vector.

If instead they went to botnet approach, then I'm thinking they'd kill the app from CPU usage well before we ran out of atom storage.

I agree it's important to consider these things. But if atoms are actually a better solution in this case (and I don't know if they are), then I wouldn't let a hypothetical DOS attack get in the way.

Dave

On Thu, Oct 31, 2019 at 9:03 AM José Valim notifications@github.com wrote:

Great!

I would have done it by default, had it not been for Floki.

And generating atoms from user input is not the safe too. So even if someone does HTML sanitization, if you generated atoms, they would be vulnerable to atom leakage attacks.

(1) I am thinking about using odd minor versions for unstable releases and even for stable ones in the future, à la Node.js, what do you think about that?(1) I am thinking about using odd minor versions for unstable releases and even for stable ones in the future, à la Node.js, what do you think about that?

This is not supported by SemVer and it is nothing something that ~> 1.2 in Elixir would know about, so I don't think it would be a good fit.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pragdave/earmark/issues/300?email_source=notifications&email_token=AAACTGGR7CCTFMEMYS5FIKLQRLQUNA5CNFSM4JHJW2ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECX4N3A#issuecomment-548390636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACTGD3TD56EIMHJAXYLXTQRLQUNANCNFSM4JHJW2ZA .

OvermindDL1 commented 4 years ago

For note, I use earmark to handle markdown for a messaging system, so I definitely want it user-markdown-safe by default and anything that breaks that to be super well documented as such. ^.^

RobertDober commented 4 years ago

nice pun Dave

I agree it's important to consider these things. But if atoms are actually a better solution in this case (and I don't know if they are), then I wouldn't let a hypothetical DOS attack get in the way.

we could even close the vector completely when parsing raw HTML by just checking for valid html tags or even simpler, counting different tags and stopping to parse when we hit, say, 4242 ;)

RobertDober commented 4 years ago

@OvermindDL1 I am afraid user provided Markdown will never be safe :cry:

As far as I know, Earmark does a good job as much as most markdown converters to not to do anything stupid, but beyond that...

josevalim commented 4 years ago

Hi @pragdave! I have decided to do some math:

The default atom limit is 1048576.

A combination of one letter plus letter+digits four times is enough to generate:

26 * 36 * 36 * 36 = 1213056 combinations

Assuming each combination is four bytes plus </> to terminate the tag, we need 7 bytes * 1048576 to exhaust the atom list, which is equivalent to a 7MB markdown document filled with HTML tags.

So even if you limit the upload to 100Kb (low number), you can still kill a server in roughly 70 requests. :D

RobertDober commented 4 years ago

@josevalim I can make atoms safe if you prefer them, and if not we will keep strings, let's see how it develops on ex_doc's side but just to take your mind off the safety concern.

josevalim commented 4 years ago

we could even close the vector completely when parsing raw HTML by just checking for valid html tags or even simpler, counting different tags

We could definitely check tags but no such luck when it comes to attributes. In my mind, the choice you did right now is the best. The atom should be gated behind a flag, as it is unsafe (or we do a pass on ExDoc side to turn everything from strings to atoms, that's not a problem either)!

pragdave commented 4 years ago

The default atom limit is 1048576.

A combination of one letter plus letter+digits four times is enough to generate:

26 36 36 * 36 = 1213056 combinations

I hate it when you're logical... :)

As an aside, why can't atoms be weak references? If an atom was an MD5 of its name (we don't need to be secure), and the atom table cached the last (say) 100,000 name/hash pairs, with some simple LRU purging, then we could just regenerate the hash if the previously cached value had been flushed.

josevalim commented 4 years ago

I think it is a solvable problem (atoms being weak references) so it is a matter of having someone do or prioritize the work (it is doable but not trivial).

RobertDober commented 4 years ago

@josevalim as a matter of fact attributes are a real problem now, because they are stored as Keyword lists :fearful: .

So we definitely have to address the problem before making as_ast the default.

I knew how stupid I am, but I wanted to hide it from you :sob:

iex(13)> x = ~s{<alpha x="b">}
"<alpha x=\"b\">"
iex(14)> Earmark.as_ast(x)
{:error, [{"alpha", [{"x", "b"}], []}],...
josevalim commented 4 years ago

Hi everyone, the AST format that Erlang choses contains atoms, but after looking at their AST, I am certain we will have to pre-process it, so it makes sense we have to pre-process Earmark AST too. Therefore, I am officially withdrawing this request. I actually like that, by simply returning the AST, Earmark can stay thin and just let people do whatever they want.

In any case, I would still suggest to make the comment node look like {:comment, node}, to make it easier to distinguish from regular tags. I will open up an issue for it. Otherwise implementations like this, will end-up outputting comments, and they most likely need to discard it instead.

Thanks!

RobertDober commented 4 years ago

Makes perfect sense to distinguish by shape instead of type only.

My idea was that processing will become easier by using the same shape, but @josevalim is right (as so often) that instead of facilitating processing it facilitates errors.

I will open the issue myself @josevalim thank you for your input.