Possibility to enable passing unicode-range instead of glyphs?

bjrn commented 3 years ago

subset-font allows for passing in a string with glyphs to subset, but would it be interesting to also include an option to pass a unicode-range like possible with pyftsubset?

I'm aware that subfont provides a conversion utility to convert a string to a unicode-range (for CSS output) but do you know if there's a standard-ish way of doing the opposite? I'm suspecting there might be a few weird exceptions to take into account?

papandreou commented 3 years ago

Good idea. Should be fairly straightforward.

I also wanted to look into whether it'd be possible to only include ligatures that might actually be exercised by the text provided. Eg. there's no reason to preserve an "ff" ligature if the text is "foof". Maybe it'd make sense to tackle those two ideas together.

papandreou commented 3 years ago

do you know if there's a standard-ish way of doing the opposite? I'm suspecting there might be a few weird exceptions to take into account?

Hmm, yeah, the U+4?? syntax looks like fun: https://developer.mozilla.org/en-US/docs/Web/CSS/@font-face/unicode-range

In terms of subsetting I guess it's fine to just expand that to all the possible values, whether or not those codepoints actually exists in the font (or in the Unicode repertoire 😅 ). The subsetting code should just ignore the codepoints that don't exist in the original font.

papandreou commented 3 years ago

This module looks like it's up to the task: https://github.com/Japont/unicode-range

bjrn commented 3 years ago

Good find! Yes, I made a super-naïve test-script, before I stumbled upon that U+4?? syntax 😬, will take a look at that one! II completely understand if you want to keep this library small and focused, and that specifying a unicode-range might be an edge case which is better solved with providing an example in the readme where the conversion takes place prior to calling subset-font. I'll play around a bit with it and get back.

no reason to preserve an "ff" ligature if the text is "foof"

true … but isn't the text converted to a Set (of sorts, I'm not familiar with harfbuzz) and sorted?

papandreou commented 3 years ago

I'll play around a bit with it and get back.

Great! Good luck! 🍀

no reason to preserve an "ff" ligature if the text is "foof"

true … but isn't the text converted to a Set (of sorts, I'm not familiar with harfbuzz) and sorted?

Yes, I think we'll have to go even more low level when instructing harfbuzz about which glyphs to include -- if that's even supported 😬

bjrn commented 3 years ago

const path = require('path');
const { readFile, writeFile } = require('fs').promises;
const subsetFont = require('subset-font');
const { UnicodeRange } = require('@japont/unicode-range');

const latinRange = 'U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD';

// util to handle passing unicode-range as a string
function formatRange(range) {
  if (typeof range === 'string') {
    return range.replace(/\s*/g, '').split(',');
  }
  return range;
}

function getGlyphsFromUnicodeRange(range) {
  // UnicodeRange currently requires an array of ranges …
  const rangeArray = formatRange(range);

  const glyphs = UnicodeRange.parse(rangeArray).map((cp) =>
    String.fromCodePoint(cp)
  );

  return glyphs;
}

async function generateFont() {
  const font = await readFile(
    path.resolve(__dirname, 'woff2', 'SomeFontFile.woff2')
  );

  const glyphs = getGlyphsFromUnicodeRange(latinRange);

  const result = await subsetFont(font, glyphs, {
    targetFormat: 'woff2',
  });

 // ... and so on
}

Did a quick try, and from what I can tell so far, that library does the trick 👍🏼 .

I don't know how you feel, but figuring out which glyphs to subset might seem a bit out of scope for subset-font after all (in the same way as subfont handles parsing of content etc.). Let me know if you want me to make a PR with an example, or anything regarding this.

Skipping unused ligatures is an interesting one, depending on language group there might be some savings. I have mostly thought about it as a on/off thing, (ie. liga is either enabled or disabled for the font). In my current use-case, there's a mix of static and dynamic content, hence the need to subset fonts based on unicode-range rather than individual codepoints … I would love to dive deeper into it though

papandreou commented 3 years ago

Great that you got it to work! Thanks for sharing your solution. I agree with your scope concern. Let's leave it here for now and see if it comes up as a common request. Maybe we can even add a link to this issue to the README.

Skipping unused ligatures is an interesting one, depending on language group there might be some savings. I have mostly thought about it as a on/off thing, (ie. liga is either enabled or disabled for the font).

I'll probably explore it one day when I have time. I'm not sure that the savings will be big either, it's mostly from a perfectionist angle. Spending years hunting down these kilobyte savings does that to you 🙈

In my current use-case, there's a mix of static and dynamic content, hence the need to subset fonts based on unicode-range rather than individual codepoints … I would love to dive deeper into it though

Ah yes, that makes sense! Btw. subfont has an experimental --dynamic switch that renders the pages in a headless browser and does additional tracing inside it. But it might not work for you, depends on exactly how dynamic the content is :)

papandreou commented 3 years ago

I'd also be happy to entertain the idea of configuring subfont to include a given unicode-range of characters in the subsets, regardless of what the tracing step says. It wouldn't really be hard to do, I think the main challenge would be to come up with a way to configure it if it has to be configurable per @font-family declaration.

bjrn commented 3 years ago

Yes it could be a good fit within subfonts scope actually — much of the tooling around generating @font-face declarations would be useful, just that instead of deriving unicode-range from parsed content, it would be provided by the configuration.

Regarding the per @font-family declaration, that is a tricky one, since I guess much of the idea behind subfont is to enable it as a drop-in addition to static site generators

papandreou commented 3 years ago

Yeah, that is the core use case, but I'm not opposed to exposing more controls like that. We could even do it as a custom CSS property in the @font-face rule, eg.:

@font-face {
  font-family: foo;
  src: ...;
  font-weight: 700;
  -subfont-unicode-range: U+0131, U+0152-0153, U+02BB-02BC;
}

papandreou commented 3 months ago

For what it's worth, Munter/subfont#161 implemented the ability to specify text to include in the subset via -subfont-text.

I'll close this for now.

papandreou / subset-font

Possibility to enable passing unicode-range instead of glyphs? #6