open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust
https://crates.io/crates/unic
Other
234 stars 24 forks source link

Indirect table slices #207

Closed CasualX closed 6 years ago

CasualX commented 6 years ago

Hello,

I apologize in advance for this unsolicited PR, I am using the PR functionality to demonstrate a potential improvement to the codebase but its contents do not adhere to the contributing guidelines.

CAD1997 posted this reddit thread asking for help with some problems.

While I cannot directly help fix the problem I did notice some potential for improvement:

CharDataTable is rather inefficient when the mapped to value V has more than one value. Typically V then becomes for some property T: &'static [T] incurring quite a hefty overhead of 16 bytes per char for the slice reference as well as creating 10,000s of static slices that the compiler needs to deal with.

To improve this situation you can create a second table indexed by V being a Range<u32> or even Range<u16>. On x64 this means V is now 8 or 4 bytes respectively (compared to 16 bytes) and instead of 10,000s of small static slices the compiler just deals with one huge static slice.

In cases where V ends up being &'static [&'static str] and there are many string literal repetitions it can be further improved by substituting an enum for the &'static str and leaving conversion from the enum to string to users of the crate. However this has impact on user facing APIs (you cannot return the slice of string slices directly any more).

That is what I attempt to implement in this PR, but I didn't get very far.

  1. I introduced a trait to replace the CharDataTable enum which will allow me to create dedicated structs to the different ways the tables can be stored. Now that I think about it this probably isn't too essential. I rename the old methods so I can easily find the places I need to use the new trait.

  2. I create dedicated structs for the two variants for the CharDataTable enum and implement the previous trait.

  3. I selectively replace some of the CharDataTable instances to the new struct, to get an idea of how feasible this approach is. Hacked together by editing the generated code because I don't know how the code generation step works.

  4. I want to show how this indirect table can work. Actually updating the generated data was too daunting so I stopped there.

Anyway that's where I got after a few hours. If the compiler's problem is that literal 10,000s of small static slices are created then instead concatenating them in one big table from which they're indexed may help. This indexing approach should still be able to reduce memory overhead of all the slice references.

CAD97 commented 6 years ago

Sorry this took so long to get to; it looks interesting! I think the idea of an indirect slice will definitely benefit us in some cases.

I've been playing with the tables off-and-on ever since the original Reddit post, and came across some additional info that I'm trying to integrate.

I think the eventual target currently looks something like this:

// we want inherent impls but traits for this pseudo-rust example
trait CharSet {
    fn contains(&self, needle: char) -> bool;
}
trait CharMap {
    type Item: Copy;
    fn find(&self, needle: char) -> Option<Self::Item>;
}

impl CharSet for CharTrieSet {} // Something like stdlib's BoolTrie
impl CharSet for T: CharMap<Item=()> {
    fn contains(&self, needle: char) -> bool { self.find(needle).is_some() }
}

impl CharMap<Item=T> for CharRangeMap<T>         {} // eytzinger-ordered `&[(CharRange ,   T)]`
impl CharMap<Item=T> for CharMap<T>              {} // eytzinger-ordered `&[(char      ,   T)]`
impl CharMap<Item=T> for IndirectCharRangeMap<T> {} // eytzinger-ordered `(&[CharRange], &[T])`
impl CharMap<Item=T> for IndirectCharMap<T>      {} // eytzinger-ordered `(&[char]     , &[T])`
impl CharMap<Item=T> for CharPhfMap<T>           {} // Perfect Hash Function (maybe) (+indirect?)
impl CharMap<Item=T> for CharNameMap             {} // Uses unique compression techniques

// And FST for Name->char

Each individual table would then get to choose which format suits its individual needs best.

CasualX commented 6 years ago

Those names sound far better than I could ever come up with :P

That's the general idea yes, with some comments:

Using traits instead of an enum generic over V will be necessary for indirect data tables since you'll need to differentiate between the returned value (eg. &'static [V]) and the backing store array type (eg. [V; X]). This cannot be done with the enum as it exists now.

Further I'm not sure why you'd want to split it into two traits unless it's absolutely necessary. I'm not a fan of generalizing a solution before I understand and implemented a specialized version for my own use case. So I'd keep just one trait CharMap, contains can always just have a default implementation.

Unfortunately I don't understand the code generation aspect, with some help I'll see if I can make this PR work.

behnam commented 6 years ago

Thanks, @CasualX, for submitting this work. We love to get new collaborators or random submissions!

I finally found the time to take a look here and all sound good. Now, since @CAD97 is already implementing these ideas, along other things, in GH-218, I'm going to close this PR and we can keep the rest of the conversation there.

I mentioned some rationals about the table generation process in GH-218 already, but there's one more thing to add here. We didn't spend much on optimizing these steps so far, mainly because we were expanding the libraries to help us figure out what are the abstractions we want. Now that we have many different data types already in place, I think it's a very good time to work on it.