tscanlin / tocbot

Build a table of contents from headings in an HTML document.
MIT License
1.37k stars 114 forks source link

Using `headingObjectCallback` doesn't modify ID with `createLink`? #314

Closed E-VANCE closed 8 months ago

E-VANCE commented 8 months ago

Hi there,

thanks for providing this useful and lean tool to automatically build a TOC from our HTML content – very handy and much appreciated!! 👏

I just had the case of needing to re-write the underlying object IDs in order to clean them up / sanitize 'em due to issues with the defalt URL syntax that would be produced which includes a number of 'illegal' characters. On top we wanted to make sure the anchor links aren't too long so I am also capping them after a certain number of chars.

While this generally works fine using the headingObjectCallback I am facing an issue with this procedure as it doesn't modify the actual heading IDs and thus breaks the anchor linking...

Looking into the createLink-function (LINK) it seems that there currently is no way of passing and using the newly modified object with the sanitized and shortened IDs?

Do you see a way of unifying the modification behaviour so that it also adjust the links / IDs or is there maybe another approach to make this happen?

This is the TOCBot config we're using including the custom callback:

tocbot.init({
  // See https://github.com/tscanlin/tocbot#options
  tocSelector: '.js-toc',
  contentSelector: '.js-toc__content',
  headingSelector: 'h2, h3, h4, h5',
  collapseDepth: 6,
  orderedList: false,
  headingObjectCallback: function(object, HTMLElement) {
    // Santitize heading by mimicing WP's sanitize_title()
    var sanitizedTitle = object.textContent.toLowerCase().trim();
    sanitizedTitle = sanitizedTitle
      .replace(/ö/g, 'oe')
      .replace(/ä/g, 'ae')
      .replace(/ü/g, 'ue')
      .replace(/ß/g, 'ss');
    sanitizedTitle = sanitizedTitle.replace(/\s+/g, '-');
    sanitizedTitle = sanitizedTitle.replace(/[^a-z0-9-]/g, '');

    // Assign sanitized title to object as new ID
    object.id = sanitizedTitle;

    // Cap the title at 50 chars
    if(sanitizedTitle.length > 50) {
      var truncatedString = sanitizedTitle.substring(0, 50);
      var lastHyphenIndex = truncatedString.lastIndexOf('-');

      if(lastHyphenIndex !== -1) {
        truncatedString = truncatedString.substring(0, lastHyphenIndex);
      }

      // Assign truncated title to object as new ID
      object.id = truncatedString;
    }

    return object;
  },
});

Thanks & regards!

tscanlin commented 8 months ago

Hey!

Happy to try and help with this. In this case I would say the headingObjectCallback isn't really meant for that since as you say it doesn't modify the source and in general that is something that tocbot stays away from since that content is out of our control and may be managed by other things.

Anyway, one option we do provide is this make-ids.js util people can use to add id's to elements if they are missing. In this case you would need to maybe adjust the logic where it reuses the heading id to clean it up. Maybe something like below but might need more sanitization depending on your use case. Let me know if you are able to figure out something though, I would love to make this script better for others too. If you have time for a PR that would be awesome or if not just whatever code helps to sanitize.

var id = heading.id
      ? heading.id
      : heading.textContent
id = id.trim().toLowerCase()
        .split(' ').join('-').replace(/[!@#$%^&*():]/ig, '').replace(/\//ig, '-')

https://github.com/tscanlin/tocbot/blob/master/src/utils/make-ids.js

I am also open to helping if you can give me a sample page/site to checkout or can share some of the content / examples you are dealing with.

Also, if you do use that script be sure to call it before tocbot.init. Anyway, hope that helps!

E-VANCE commented 8 months ago

Hi Tim,

many thanks for getting back and pointing me towards the solution – I was indeed already using the make-ids.js-util before initializing tocbot and should have realised where to look for in the first place 🙄

But since we're already at it, I'd propose a slight change / enhancement when sanitizing the headings:

Whitespaces .replace(/\s+/g, '-') instead of .split(' ').join('-')

This would allow for multiple whitespace characters to be unified into one hyphen only vs. before where it would render some popular string with multiple whitespaces into some-popular-string---with---multiple-whitespaces.

Special chars .replace(/[^a-z0-9-]/g, '') instead of .replace(/[!@#$%^&*():]/ig, '')

Here you are omitting numerous chars that render invalid URLs / anchors IMHO, such as " or , which are pretty common at least within our headings.

Umlauts This admittedly is a special case and probably shouldn't be part of the default sanitation but just to mention it for anyone else with a German background:

.replace(/ö/g, 'oe')
.replace(/ä/g, 'ae')
.replace(/ü/g, 'ue')
.replace(/ß/g, 'ss')

That takes care of the Umlauts and renders them in more conform fashion.

I have open a PR with the proposed changes.

Many thanks again for this utility and for all the work that you've put into this over all the years! 🙏

tscanlin commented 8 months ago

Awesome, thanks for the explanation and the PR!! Many thanks to you! This is published in tocbot@4.21.3. I appreciate you making tocbot better for everyone, the logic you switched to is much cleaner :)