Punctuation handling in header ids and references: toward a standard and implementation

jupyter / nbconvert

Jupyter Notebook Conversion

https://nbconvert.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

1.75k stars 569 forks source link

Punctuation handling in header ids and references: toward a standard and implementation #471

Open mpacer opened 8 years ago

mpacer commented 8 years ago

We need to refine our autogeneration of ids/labels for headers. This is one issue related to this problem, focusing on punctuation.

Presumably the goal should be to enforce a set of constraints that allow a single tag to work across all of our basic output formats.

This means our ids/labels will need to adhere to the union of all the constraints… I'm going to use this issue to begin gathering those together.

mpacer commented 7 years ago

From this discussion re: LaTeX labels if we were to be the most conservative, we should avoid &, _, ^, %, ~, $,#, \, { and }.

Based on pandoc's solution we should just remove all punctuation except for _, -, and ..

So the open question is whether we should keep _ since that can cause problems in the context of some LaTeX packages (esp, apparently \usepackage{underscore}).

My default would be to include _ just to keep as close to pandoc's standard as possible.

mpacer commented 7 years ago

After taking time to dive into the pandoc codebase and thinking about these regex problems more generally, I'm going to suggest (mostly for future proofing for non-Latin scripts) using the regex package which has access to unicode character property classes, which means we can meet parity with the unicode compatibility of pandoc without hard coding in code ranges, which seems like the other viable solution.

mpacer commented 7 years ago

This addresses the punctuation issue: https://github.com/michaelpacer/nbconvert/tree/anchors_away