Support symbol '-' in standard_ref()

tocic commented 1 month ago

We should support the symbol - in our references. For example,

§[dcl.spec.auto]¶general-example-4

should not be rendered as

§[dcl.spec.auto]¶general-example-4

tocic commented 1 month ago

I tried to make a solution for the gerenal case and came up with the patch below. But it turned out to be more difficult than I expected :).

I downloaded this page and used the following grep to get a list of possible sections:

grep -oP '<a href="https://timsong-cpp.github.io/cppwp/std23/[^#]+#\K[^"]+' 14882_\ Index.html > sections.txt

As you can see, there are many interesting examples that we don't support yet, especially the ones that end with ) and -, because we can't easily disambiguate them from the surrounding punctuation. For example,

... (§[defns.block]¶def:block_(execution)).

should be displayed as

... (§[defns.block]¶def:block_(execution)).

I don't think we can solve that with a single regex (but maybe we can postprocess the match).

But there are only 60 \W$ hits out of 5238 lines in generalindex, 0/3502 in grammarindex, 0/492 in headerindex, 678/9520 in libraryindex, 0/1940 in conceptindex, and 4/337 in impldefindex. So what do you think if we support only those sections that end with a word character (\w)? Maybe allowing the chars +-=*~ is also safe (leaving only 39 examples in generalindex and 90 examples in libraryindex unsupported).

click here to see the patch

```diff diff --git a/quiz/templatetags/quiz_extras.py b/quiz/templatetags/quiz_extras.py index 4b637ab..0295d9f 100644 --- a/quiz/templatetags/quiz_extras.py +++ b/quiz/templatetags/quiz_extras.py @@ -18,21 +18,36 @@ def format_reference(match): full_link = "https://timsong-cpp.github.io/cppwp/n4659/" + section_name if paragraph_number: full_link = full_link + "#" + paragraph_number - return "" + full_reference + "" + return f"*[{full_reference}]({full_link})*" def standard_ref(text): - section_name = u'(\[(?P[\w:]+(\.[\w:]+)*)\])' - possible_paragraph = u'(¶(?P\d+(\.\d+)*))*' - regex = re.compile('§(' + section_name + possible_paragraph + ')') - return re.sub(regex, format_reference, text) + regex = re.compile( + r''' + §\[ # every ref starts its section part with §[ + (?P # + [^]]+ # at least one character other than ] + ) # + \] # every ref ends its section part with ] + (?:¶ # ref may start an optional paragraph part with ¶ + (?P # + \S* # any number of non-whitespace chars + \w # supported sections must end with a word char + ) # + )? # + ''', + re.VERBOSE, + ) + return regex.sub(format_reference, text) @register.filter(needs_autoescape=True) def to_html(text, autoescape=None): return mark_safe( - standard_ref( - markdown.markdown(text, extensions=['nl2br', 'pymdownx.superfences']))) + markdown.markdown( + standard_ref(text), extensions=['nl2br', 'pymdownx.superfences'] + ) + ) ```

knatten commented 1 month ago

Thanks for investigating all of this! I'm very busy this week so don't have time to dig into it. Something that supports most cases and isn't too complicated sounds good to me. But I'd like to see some examples of the types of links we match, and some things we shouldn't match.

However, instead of regular unit tests, it can often be nice to just have a comment in the code with a link to regex101, like this https://regex101.com/r/Wxld8e/2

tocic commented 1 month ago

I believe we should parse the refs first and only then convert markdown to html. In that case, this is the list of possible refs we might encounter.

tmp=''
for index in 'general' 'grammar' 'header' 'library' 'concept' 'impldef'; do
  tmp+=$(grep -oP '<a href="https://timsong-cpp.github.io/cppwp/std23/\K[^"\s]+(?=")' ${index}.html | python3 -c "import sys,html,urllib.parse; print(urllib.parse.unquote(html.unescape(sys.stdin.read())))")
  tmp+=$'\n'
done

echo "${tmp%?}" | sed -e 's/^/§[/' | sed -e 's/#/]¶/' | sort | uniq | head -c -1 > possible_refs.txt

possible_refs.txt

And this is how the suggested regex performs on the dataset (1711/20945 are unsupported):

https://regex101.com/r/hf2P3X/1

tocic commented 1 month ago

And here are the characters I think we can safely support:

§\[[^]]+\](?:¶\S*[\w&></^|+%=~*-])?

, leaving 158/20945 unsupported (https://regex101.com/r/hf2P3X/2).

Alternatively, we can try to enumerate the postfixes that the valid reference can't have like this:

§\[[^]]+\](?:¶\S*[^\s:,;.?)](?<!\?!|""|\*\*|!!))?

, leaving 131/20945 unsupported (https://regex101.com/r/hf2P3X/3).

knatten commented 3 weeks ago

I prefer the first one, the second one matches more things but is just so hard to read. Want to make a PR?

knatten / cppquiz

Support symbol '-' in standard_ref() #344