Open mpacer opened 8 years ago
From this discussion re: LaTeX labels if we were to be the most conservative, we should avoid &
, _
, ^
, %
, ~
, $
,#
, \
, {
and }
.
Based on pandoc's solution we should just remove all punctuation except for _
, -
, and .
.
So the open question is whether we should keep _
since that can cause problems in the context of some LaTeX packages (esp, apparently \usepackage{underscore}
).
My default would be to include _
just to keep as close to pandoc's standard as possible.
After taking time to dive into the pandoc codebase and thinking about these regex problems more generally, I'm going to suggest (mostly for future proofing for non-Latin scripts) using the regex
package which has access to unicode character property classes, which means we can meet parity with the unicode compatibility of pandoc without hard coding in code ranges, which seems like the other viable solution.
This addresses the punctuation issue: https://github.com/michaelpacer/nbconvert/tree/anchors_away
We need to refine our autogeneration of ids/labels for headers. This is one issue related to this problem, focusing on punctuation.
Presumably the goal should be to enforce a set of constraints that allow a single tag to work across all of our basic output formats.
This means our ids/labels will need to adhere to the union of all the constraints… I'm going to use this issue to begin gathering those together.