Closed GoogleCodeExporter closed 9 years ago
I think there are some misunderstandings here..
> [0-9a-f-A-F] is equivalent to \w+
no, because \w matches _ and A-Za-z
> the group cannot be captured anyways due to RegexOptions.ExplicitCapture
The match still occurs, you just can't get to it by saying $1 or $2 or $3 etc
> w is not a valid html escape sequence
have you tried it? produces "w" in Firefox 3.5, IE8, Chrome3..
I don't think there's any evidence this regex is wrong, as it is part of the
core
Markdown.pl library. Can you provide actual test cases where it produces
incorrect
results?
Original comment by wump...@gmail.com
on 31 Dec 2009 at 8:04
I may be stupid but I do not find the line
> [0-9a-f-A-F] is equivalent to \w+
in the issue at all... Rather I am saying that [0-9a-fA-F] is INCLUDED in \w,
therefore [0-9a-fA-F]+ is included in \w+ as well.
Assuming this is correct, the union of those sets can be computed without even
knowing the sets, since the union of a set s with a subset of s is
always s. This allows us the following transform:
([0-9a-fA-F]+|\w+) => (\w+)
According to Microsoft ( http://msdn.microsoft.com/en-us/library/20bw873z.aspx
) the \w character class contains "Letter, Uppercase", "Letter,
Lowercase" and "Number, Decimal Digit" - so I am quite certain that the
transform is in deed equivalent to the original regular expression.
Concerning the group: I never said anything about the match not occuring, I
said "the group cannot be captured anyways", which I would consider
being quite true for a non-capturing group... RegexOptions.ExplicitCapture
allows the usage of unnamed parentheses as non-capturing groups, see
there:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexopti
ons.aspx
Since we are dealing with a unnamed group, it is non-capturing under this
option. It also does not do any "grouping" with respect to operators in
this case, since it only contains \w+, so it can simply be left out, dealing
the transform:
(\w+) => \w+
About the escape sequences:
After looking through the html spec, it seems I was indeed mistaken and
remembered it exactly the wrong way around:
The HTML 4.01 spec says in section 5.3 (
http://www.w3.org/TR/1999/REC-html401-19991224/charset.html ) allows the
following two representations in
addition to named entities:
"The syntax "&#D;", where D is a decimal number, refers to the ISO 10646
decimal character number D.
The syntax "&#xH;" or "&#XH;", where H is a hexadecimal number, refers to the
ISO 10646 hexadecimal character number H. Hexadecimal numbers in
numeric character references are case-insensitive."
XHTML 1.0 ( http://www.w3.org/TR/2002/REC-xhtml1-20020801/ ) notes in section
4.12:
"SGML and XML both permit references to characters by using hexadecimal values.
In SGML these references could be made using either &#Xnn; or
&#xnn;. In XML documents, you must use the lower-case version (i.e. &#xnn;)"
Therefore, in my opinion, the "correct" way to formulate the inner part of the
regex is:
(#[xX]?)?\w+;
However, it does not really matter anyway, since this essentially means:
{all strings consisting only of word characters} \ {"#;", "#x;", "#X;"}
more interesting is the following version, that follows the specs rather
closely:
(#[0-9]+|([xX][a-fA-F0-9]))|([a-zA-Z0-9]+)
Named character entities in HTML 4.01 (
http://www.w3.org/TR/1999/REC-html401-19991224/sgml/entities.html ) and XHTML 1
(
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#h-A2 ) specs consist of
english letters (the first is oftentimes uppercase) and in some
cases numbers (e.g. <!ENTITY frac34 "¾"> <!-- vulgar fraction three quarters =
fraction three quarters, U+00BE ISOnum --> )
"I don't think there's any evidence this regex is wrong, as it is part of the
core Markdown.pl library."
The first part is not about it being "wrong" in any kind, just about doing
equivalent transforms that do not change the meaning, while making it
more easy to understand just what it does.
The second part however, is about what the regex is supposed to do: Find
ampersands that are not part of an html escape sequence. The idea seems to
be to escape any ampersand that is supposed to be one, while still allowing the
use of escape sequences. The number of false positives will never
be zero due to the simple fact that the different specs for (x)html differ on
this point, but it CAN be reduced. If the design intention is to not
escape all ampersands that are part of something that might be mistaken for an
escape sequence: well, that is a design decision...
To ensure that I am not mistaken about html escape sequences again, I have
tested &x77; in all browsers on my system (chrome, firefox, safari, ie
and opera) and I DO get &x77; if I use that "escape sequence" - and not a "w".
Also, with the original regex &x77; is not escaped...
Original comment by Shio...@gmail.com
on 31 Dec 2009 at 9:28
I see your point on ([0-9a-fA-F]+|\w+) and I agree the other form is more
accurate.
But there is a documentation reason it is written that way..
Two clarifications:
1) You know this is a *negative* lookahead, yes?
&(?!)
meaning, match '&' but ONLY if the rest isn't matched ahead of it.
2) it can also negatively match … not just w and m
I think the |\w+ was an attempt to document the the … case. Though as you
pointed out I doubt any named entities have _ in them, or numbers for that
matter.
As I've said in other requests to change the regexes, there are only two
reasons to
do so:
a) performance improves
b) fixes a bug
change for the sake of change is bad because
c) it makes us deviate from the Perl standard reference implementation regexes
which would be OK if a) or b) were true.. but they are not in this case.
Original comment by wump...@gmail.com
on 1 Jan 2010 at 12:48
Original comment by wump...@gmail.com
on 1 Jan 2010 at 12:52
1) Of course.
2) The point being that it DOES negatively match &x77; - although it should
not.
Named entities do have numbers, (sadly) there is i.e. frac34...
The amount of false negatives you get it humungous, but I can understand
implementing
a regex like (nbsp|cent|pound|copy|brvbar|......) would be most annoying (and
slow,
too). However, there are no entities starting with a number, nor are there any
starting with a hash (except for the two types of numeric ones of course).
Therefore
@"&(?!(#[0-9]+)|(#[xX][a-fA-F0-9])|([a-zA-Z][a-zA-Z0-9]*);)"
drastically reduces the number of false negatives (being false positives of the
lookahead). Maybe most importantly it catches &x77; or &172; ones, which might
happen.
Therefor I would indeed consider this to be of kind b) - however, if being
close to
the perl implementation is more important than catching these kind of errors,
well,
that is a design decision ;)
Original comment by Shio...@gmail.com
on 1 Jan 2010 at 1:27
I appreciate the contribution.
I will definitely keep this in mind -- the goal at the moment is to match
markdown.pl
1.0.2b8 very closely. If in the future we start deviating (and we might, since
John
Gruber seems utterly uninterested in any public work on this since 2004), then I
agree this is a better and more accurate regex.
Still kind of a narrow and extremely minor "bug", though. I'd prefer some help
with a
bigger and much more severe bug -- the HTML block matching algorithms in
1.0.2b8 :)
Original comment by wump...@gmail.com
on 1 Jan 2010 at 2:01
ok, now that we have been forced to absorb parts of PHP Markdown (because it is
better maintained), and thus deviating from the "canon" of the 1.0.2b8 Perl
version
.. I am open to making this small change. reopening
Original comment by wump...@gmail.com
on 6 Jan 2010 at 11:47
checked into r81
Original comment by wump...@gmail.com
on 7 Jan 2010 at 12:13
Original issue reported on code.google.com by
Shio...@gmail.com
on 31 Dec 2009 at 6:32