Reorganise repository, add preprocessor

googlefonts / nam-files

Unicode ranges used to subset fonts in the Google Fonts CSS API

Apache License 2.0

3 stars 0 forks source link

Reorganise repository, add preprocessor #3

Closed simoncozens closed 10 months ago

simoncozens commented 10 months ago

Precursor to #2. This simply:

Moves the existing nam files to a directory subset-inputs.
Adds a preprocessing Python script, which as the nam files have no magic commands, does nothing other than add a "generated" header to the input.
Checks in the output of the preprocessor to Lib/gfsubsets/data.

garretrieger commented 10 months ago

I wrote a script to compare the input and the outputs, and found that several of the output nam files are missing large chunks of codepoints compared to the input files, for example all of the CJK subsets seems to be missing a large number of codepoints. Could you take a look?

garretrieger commented 10 months ago

Here's the script I used:

#!/bin/bash
for f in subsets-input/*.nam; do
  in=$(mktemp)
  out=$(mktemp)
  cat $f | grep -v -E "^[#]" | awk '{if ($1 != "") { print $1 };}' > $in
  cat ./Lib/gfsubsets/data/$(basename $f) | grep -v -E "^[#]" | awk '{ if ($1 != "") {print $1};}' > $out
  echo $f
  diff -u $in $out
done

simoncozens commented 10 months ago

I think there may be something up with your script. For example, it reports:

subsets-input/arabic_unique-glyphs.nam
--- /var/folders/jp/4p0m9zvx2l739tdpflm38kv40000gn/T/tmp.PibM1g7lPY     2024-01-18 09:52:52
+++ /var/folders/jp/4p0m9zvx2l739tdpflm38kv40000gn/T/tmp.JfJSnO4q1Q     2024-01-18 09:52:52
@@ -1,3 +1,7 @@
+0x0000
+0x000D
+0x0020
+0x00A0
 0x0600
 0x0601
 0x0602
@@ -254,13 +258,6 @@
 0x06FD
 0x06FE
 0x06FF
-0x200C
-0x200D
-0x200E
-0x2010

i.e. that 200C, 200D, 200E and 2010 are missing from Lib/gfsubsets/data/arabic_unique-glyphs.nam. But they aren't:

$ grep 0x20 subsets-input/arabic_unique-glyphs.nam
0x200C # ZERO WIDTH NON-JOINER
0x200D # ZERO WIDTH JOINER
0x200E # LEFT-TO-RIGHT MARK
0x2010 # HYPHEN
0x2011 # NON-BREAKING HYPHEN
0x204F # REVERSED SEMICOLON

simoncozens commented 10 months ago

Hmmm, although there are some differences in Chinese. Investigating...

simoncozens commented 10 months ago

Fixed it; the way I was detecting unassigned codepoints was unreliable. Now the only differences is that some genuinely unassigned codepoints have been removed. Here's my comparison script:

#!/bin/bash
for input in subsets-input/*.nam; do
  output=./Lib/gfsubsets/data/$(basename $input)
  echo $input
  diff -u <(awk '{print $1}' $input | sort | grep '.' | grep -v '^#') <(awk '{print $1}' $output | sort | grep '.' | grep -v '^#')
done

garretrieger commented 10 months ago

Ok great, the diffs look good to me now.