Open tocic opened 1 month ago
I tried to make a solution for the gerenal case and came up with the patch below. But it turned out to be more difficult than I expected :).
I downloaded this page and used the following grep
to get a list of possible sections:
grep -oP '<a href="https://timsong-cpp.github.io/cppwp/std23/[^#]+#\K[^"]+' 14882_\ Index.html > sections.txt
As you can see, there are many interesting examples that we don't support yet, especially the ones that end with )
and -
, because we can't easily disambiguate them from the surrounding punctuation. For example,
... (§[defns.block]¶def:block_(execution)).
should be displayed as
I don't think we can solve that with a single regex (but maybe we can postprocess the match).
But there are only 60 \W$
hits out of 5238 lines in generalindex, 0/3502 in grammarindex, 0/492 in headerindex, 678/9520 in libraryindex, 0/1940 in conceptindex, and 4/337 in impldefindex. So what do you think if we support only those sections that end with a word character (\w
)? Maybe allowing the chars +-=*~
is also safe (leaving only 39 examples in generalindex and 90 examples in libraryindex unsupported).
Thanks for investigating all of this! I'm very busy this week so don't have time to dig into it. Something that supports most cases and isn't too complicated sounds good to me. But I'd like to see some examples of the types of links we match, and some things we shouldn't match.
However, instead of regular unit tests, it can often be nice to just have a comment in the code with a link to regex101, like this https://regex101.com/r/Wxld8e/2
I believe we should parse the refs first and only then convert markdown to html. In that case, this is the list of possible refs we might encounter.
tmp=''
for index in 'general' 'grammar' 'header' 'library' 'concept' 'impldef'; do
tmp+=$(grep -oP '<a href="https://timsong-cpp.github.io/cppwp/std23/\K[^"\s]+(?=")' ${index}.html | python3 -c "import sys,html,urllib.parse; print(urllib.parse.unquote(html.unescape(sys.stdin.read())))")
tmp+=$'\n'
done
echo "${tmp%?}" | sed -e 's/^/§[/' | sed -e 's/#/]¶/' | sort | uniq | head -c -1 > possible_refs.txt
And this is how the suggested regex performs on the dataset (1711/20945 are unsupported):
And here are the characters I think we can safely support:
§\[[^]]+\](?:¶\S*[\w&></^|+%=~*-])?
, leaving 158/20945 unsupported (https://regex101.com/r/hf2P3X/2).
Alternatively, we can try to enumerate the postfixes that the valid reference can't have like this:
§\[[^]]+\](?:¶\S*[^\s:,;.?)](?<!\?!|""|\*\*|!!))?
, leaving 131/20945 unsupported (https://regex101.com/r/hf2P3X/3).
I prefer the first one, the second one matches more things but is just so hard to read. Want to make a PR?
We should support the symbol
-
in our references. For example,should not be rendered as