linebender / norad

Rust crate for working with Unified Font Object files
Apache License 2.0
43 stars 12 forks source link

Deduplicate codepoints on read and write #268

Closed madig closed 1 year ago

madig commented 2 years ago

Closes #263

madig commented 1 year ago

Oops, need to rebase.

madig commented 1 year ago

The serialization code path is ugly though. Know of something better?

RickyDaMa commented 1 year ago

Question: out of interest, why do we need to preserve order in the codepoints?

Also, it should be trivial to add a borrowing iterator for Codepoints:

fn iter(&self) -> impl Iterator<Item = char> {
    self.codepoints.iter().copied()
}

Also, I don't know if itertools is already pulled in as a transitive dependency, but if so you could use that for it's .unique() filter (link)

madig commented 1 year ago

Codepoints are ordered lists in the UFO spec. In defcon and ufoLib2, a glyph has the unicode and unicodes properties, where the former returns the first item of the latter.

I think the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing, and itertools uses a HashSet :)

RickyDaMa commented 1 year ago

itertools uses a HashSet

I know, and that's no different to what you've done here, just would be cleaner / more readable

the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing

Yeah so you don't deserialize into Codepoints directly, but can transform a list if you need to? Why is that preferable - I guess in cases where the user doesn't do anything involving the unicode(s)?

madig commented 1 year ago

I did a list transform before, but @cmyr said he'd prefer a type 😬 I mean, I haven't settled on anything and would gladly take the simplest solution (custom containers are a pain to bring up to the API they're wrapping).

cmyr commented 1 year ago

I understand that the order of the codepoints is important insofar as the first codepoint is special, but does the order of subsequent codepoints also matter?

madig commented 1 year ago

I don't know.

anthrotype commented 1 year ago

does the order of subsequent codepoints also matter?

it doesn't. I probably only does if this order is carried over when roundtripping from xml to xml, e.g. if one wishes to reduce diff noise

by the way, as you all well know, the whole notion of "primary unicode codepoint" is flawed because it doesn't exist in opentype cmap tables, which is a map from codepoints to glyphs and not the other way around, as font editors make it seem.

madig commented 1 year ago

We could go full-on set then, and sort on serialization, if we didn't care about git noise with other libraries.

RickyDaMa commented 1 year ago

Alternatively to reduce noise, just retain a sorted, deduplicated list. Deterministic order, without duplicates. With this approach you could also preserve the first glyph if you wanted to

cmyr commented 1 year ago

so my original thinking with the Codepoints struct is that we would separate out the 'primary' codepoint, like:

Codepoints {
    primary: char,
   others: Vec<char>,
}

but it might be nice if internally we do use a single Vec, like in this patch, because then we can have AsRef<[char]> and get all the iterators etc for free.

In this case, I do think that we should preserve the position of the first item in the array, but otherwise ensure that it is always deduplicated. To do this at construction time, we could have something like:

use std::cmp::Ordering;

struct Codepoints(Vec<char>);

impl Codepoints {
    fn new(mut raw: Vec<char>) -> Self {
        if raw.len() <= 1 {
            return Self(raw);
        }
        let first = *raw.first().unwrap();

        // custom sorting that always ranks first item lowest
        raw.sort_unstable_by(|a, b| match (a, b) {
            (a, b) if *a == first && *b == first => Ordering::Equal,
            (a, _) if *a == first => Ordering::Less,
            (_, b) if *b == first => Ordering::Greater,
            (a, b) => a.cmp(b),
        });
        raw.dedup();
        Self(raw)
    }

    fn insert(&mut self, codepoint: char) {
        if !self.0.contains(&codepoint) {
            self.0.push(codepoint);
            // don't include first codepoint in sort
            self.0[1..].sort_unstable();
        }
    }
}

And then we would want AsRef and Deref (but not AsMut or DerefMut) impls, and maybe a few more methods? for instance do we want a method for changing the primary codepoint? what other tasks are common?

anthrotype commented 1 year ago

do we want a method for changing the primary codepoint?

No! We should instead deprecate that in existing APIs like defcon or ufoLib2 and nudge users to ever only use the unicodes list, in the plural. There's no use case for a "primary unicode".

RickyDaMa commented 1 year ago

Just a note for implementation details: we will still need to use a newtype over the Vec<char> otherwise a user would be able to modify it without upholding its invariants (being ordered except the first element and having no duplicates)

madig commented 1 year ago

I can get behind using a set (and sorting on serialization) and declaring that people should use unicodes in Py libs.

anthrotype commented 1 year ago

that's what it really is, an unsorted set of unique codepoints that map to a given glyph, with no intrinsic priority over one another except for the (semantically meaningless) order in which the elements appear in the xml document. E.g. in a unicase font, if both 0x0041 and 0x0061 characters map to the same glyph (named "A" or "a" or "Colin", doesn't really matter), that doesn't make either 0x0041 or 0x0061 more "primary".

cmyr commented 1 year ago

Okay, given all this input I vote for just using indexset. This will ensure that we maintain the original order, but also ensure that we do not contain duplicates.