Closed madig closed 1 year ago
Oops, need to rebase.
The serialization code path is ugly though. Know of something better?
Question: out of interest, why do we need to preserve order in the codepoints?
Also, it should be trivial to add a borrowing iterator for Codepoints
:
fn iter(&self) -> impl Iterator<Item = char> {
self.codepoints.iter().copied()
}
Also, I don't know if itertools
is already pulled in as a transitive dependency, but if so you could use that for it's .unique()
filter (link)
Codepoints are ordered lists in the UFO spec. In defcon and ufoLib2, a glyph has the unicode
and unicodes
properties, where the former returns the first item of the latter.
I think the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing, and itertools uses a HashSet :)
itertools uses a HashSet
I know, and that's no different to what you've done here, just would be cleaner / more readable
the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing
Yeah so you don't deserialize into Codepoints
directly, but can transform a list if you need to? Why is that preferable - I guess in cases where the user doesn't do anything involving the unicode(s)?
I did a list transform before, but @cmyr said he'd prefer a type 😬 I mean, I haven't settled on anything and would gladly take the simplest solution (custom containers are a pain to bring up to the API they're wrapping).
I understand that the order of the codepoints is important insofar as the first codepoint is special, but does the order of subsequent codepoints also matter?
I don't know.
does the order of subsequent codepoints also matter?
it doesn't. I probably only does if this order is carried over when roundtripping from xml to xml, e.g. if one wishes to reduce diff noise
by the way, as you all well know, the whole notion of "primary unicode codepoint" is flawed because it doesn't exist in opentype cmap tables, which is a map from codepoints to glyphs and not the other way around, as font editors make it seem.
We could go full-on set then, and sort on serialization, if we didn't care about git noise with other libraries.
Alternatively to reduce noise, just retain a sorted, deduplicated list. Deterministic order, without duplicates. With this approach you could also preserve the first glyph if you wanted to
so my original thinking with the Codepoints
struct is that we would separate out the 'primary' codepoint, like:
Codepoints {
primary: char,
others: Vec<char>,
}
but it might be nice if internally we do use a single Vec
, like in this patch, because then we can have AsRef<[char]>
and get all the iterators etc for free.
In this case, I do think that we should preserve the position of the first item in the array, but otherwise ensure that it is always deduplicated. To do this at construction time, we could have something like:
use std::cmp::Ordering;
struct Codepoints(Vec<char>);
impl Codepoints {
fn new(mut raw: Vec<char>) -> Self {
if raw.len() <= 1 {
return Self(raw);
}
let first = *raw.first().unwrap();
// custom sorting that always ranks first item lowest
raw.sort_unstable_by(|a, b| match (a, b) {
(a, b) if *a == first && *b == first => Ordering::Equal,
(a, _) if *a == first => Ordering::Less,
(_, b) if *b == first => Ordering::Greater,
(a, b) => a.cmp(b),
});
raw.dedup();
Self(raw)
}
fn insert(&mut self, codepoint: char) {
if !self.0.contains(&codepoint) {
self.0.push(codepoint);
// don't include first codepoint in sort
self.0[1..].sort_unstable();
}
}
}
And then we would want AsRef and Deref (but not AsMut or DerefMut) impls, and maybe a few more methods? for instance do we want a method for changing the primary codepoint? what other tasks are common?
do we want a method for changing the primary codepoint?
No! We should instead deprecate that in existing APIs like defcon or ufoLib2 and nudge users to ever only use the unicodes list, in the plural. There's no use case for a "primary unicode".
Just a note for implementation details: we will still need to use a newtype over the Vec<char>
otherwise a user would be able to modify it without upholding its invariants (being ordered except the first element and having no duplicates)
I can get behind using a set (and sorting on serialization) and declaring that people should use unicodes
in Py libs.
that's what it really is, an unsorted set of unique codepoints that map to a given glyph, with no intrinsic priority over one another except for the (semantically meaningless) order in which the elements appear in the xml document. E.g. in a unicase font, if both 0x0041 and 0x0061 characters map to the same glyph (named "A" or "a" or "Colin", doesn't really matter), that doesn't make either 0x0041 or 0x0061 more "primary".
Okay, given all this input I vote for just using indexset
. This will ensure that we maintain the original order, but also ensure that we do not contain duplicates.
Closes #263