Closed simoncozens closed 10 months ago
I wrote a script to compare the input and the outputs, and found that several of the output nam files are missing large chunks of codepoints compared to the input files, for example all of the CJK subsets seems to be missing a large number of codepoints. Could you take a look?
Here's the script I used:
#!/bin/bash
for f in subsets-input/*.nam; do
in=$(mktemp)
out=$(mktemp)
cat $f | grep -v -E "^[#]" | awk '{if ($1 != "") { print $1 };}' > $in
cat ./Lib/gfsubsets/data/$(basename $f) | grep -v -E "^[#]" | awk '{ if ($1 != "") {print $1};}' > $out
echo $f
diff -u $in $out
done
I think there may be something up with your script. For example, it reports:
subsets-input/arabic_unique-glyphs.nam
--- /var/folders/jp/4p0m9zvx2l739tdpflm38kv40000gn/T/tmp.PibM1g7lPY 2024-01-18 09:52:52
+++ /var/folders/jp/4p0m9zvx2l739tdpflm38kv40000gn/T/tmp.JfJSnO4q1Q 2024-01-18 09:52:52
@@ -1,3 +1,7 @@
+0x0000
+0x000D
+0x0020
+0x00A0
0x0600
0x0601
0x0602
@@ -254,13 +258,6 @@
0x06FD
0x06FE
0x06FF
-0x200C
-0x200D
-0x200E
-0x2010
i.e. that 200C, 200D, 200E and 2010 are missing from Lib/gfsubsets/data/arabic_unique-glyphs.nam
. But they aren't:
$ grep 0x20 subsets-input/arabic_unique-glyphs.nam
0x200C # ZERO WIDTH NON-JOINER
0x200D # ZERO WIDTH JOINER
0x200E # LEFT-TO-RIGHT MARK
0x2010 # HYPHEN
0x2011 # NON-BREAKING HYPHEN
0x204F # REVERSED SEMICOLON
Hmmm, although there are some differences in Chinese. Investigating...
Fixed it; the way I was detecting unassigned codepoints was unreliable. Now the only differences is that some genuinely unassigned codepoints have been removed. Here's my comparison script:
#!/bin/bash
for input in subsets-input/*.nam; do
output=./Lib/gfsubsets/data/$(basename $input)
echo $input
diff -u <(awk '{print $1}' $input | sort | grep '.' | grep -v '^#') <(awk '{print $1}' $output | sort | grep '.' | grep -v '^#')
done
Ok great, the diffs look good to me now.
Precursor to #2. This simply:
subset-inputs
.Lib/gfsubsets/data
.