open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust
https://crates.io/crates/unic
Other
234 stars 24 forks source link

Split tables #218

Open CAD97 opened 6 years ago

CAD97 commented 6 years ago

Supersedes #207

Step one of refactoring the table structure for better future growth. I'll fix the merge conflict soon, but wanted to get this out.

I've been playing around with the tables ever since the thread prompted by miri. This is an incremental move towards the table design that I think is what we want.

The enum has been slit into two distinct types. This skips the branch required before access.

Both of the tables are now indirect -- this means having a separate &[char] and &[T], so everything should pack better. As slice::binary_search returns an index anyway, this seems like an all-around better choice for our associative slices, since space compression is what we're mostly concerned with, and the cyclomatic cost remains the same. I suspect any speed lost by not having the value next to the key should be gained in the better packing of data for the binary search. See the first commit for the full argument.

After this, the next step will be to add a table specifically for the char->Name translation. See the multiple places I've marked TODO(CAD97) for the upcoming work.

If we ever get fields with a type of impl Trait, it would probably be prudent to add a trait for the CharMap shape, so that the table structure can be changed in the future without affecting the public API surface. (Along those lines, I think there might be a public dependency in canonical_composition_mapping that we'd want to get rid of before bringing unic-ucd to 1.0, based on what I had to modify.)

r? @behnam ping @CasualX; this sort of integrates some of the path you took.


This change is Reviewable

CasualX commented 6 years ago

The PR looks great so far!

Separating the keys and values into separate tables isn't just good for packing, it's also great for cache pressure since the values of all the keys not being searched for don't pollute the cache. The savings in memory being touched for searching should far outweigh the single cache miss looking up the final value.

Furthermore LLVM should be able to figure out when the tables have the same length to eliminate some bound checks.

Have you considered generating these rsv (heh, rust separated values) files in a custom build step instead of having them checked into source control? (I can imagine generating them on build every time slows down compile times too much...)

CAD97 commented 6 years ago

Surprisingly, the rsv generation is rather quick in release mode. Computers are fast, even with the simple > efficient mindset of the generation code's ownership model.

However, we want to ship the tables pre-generated when used as a library. It would be theoretically possible to not commit the files, but especially given the current generation script which does all of the tables at one time to reduce redundant work.

I agree that it'd be better to get the downloaded/generated files out of the git repository, but that's future work and it's just easier to commit for now.

Additionally, the large number of subcrates makes it even more interesting. The end goal might be unic-gen as a library and used in individual build scripts, but for now a single manual script is easier. It's already much better than the collection of Python that most repos like this run off of, as it's actually decently documented (and if it's not, that's mostly my fault).

behnam commented 6 years ago

Haven't looked at the diff yet, but just a quick reply to the conversion here...

rsv (heh, rust separated values)

The idea was just "Rust Value", since they can be included where a value is expected.

And, I believe we have rsd, short for "Rust Definition", which contains items, to be included inside a module, trait or type definition.

However, we want to ship the tables pre-generated when used as a library. It would be theoretically possible to not commit the files, but especially given the current generation script which does all of the tables at one time to reduce redundant work.

One main reason we have decided to commit generated tables into the repo is that we want to make sure no-std components stay no-std, even in their build step. Generating these steps in no-std would add extra complexity, which IMHO is not necessary. Eventually, if we conclude that it's not a useful feature to have, we can move the generation code into build scripts.

On the source data side, we definitely don't want any build process/system to pull data over the web. It's bad for our reliability, and it's bad for unicode.org, etc servers. I'm working on setting up official git mirrors for Unicode data files. When those are available, we can make data/* directories to be just git submodules, mirroring the official repos.

Overall, IMHO we have more important features to focus on right now, and the data/ and gen/ parts of the code-base are stable enough and don't need major changes (except table format changes which @CAD97 is working on). Of course if there are any issues with the process, we can start a new issue discussion and get to the details.

CAD97 commented 6 years ago

That merges in the changes since I started the branch.

We'll see how compile time goes on Travis, and if it's abysmal then I'll try the stopgap simplification of unic/ucd/name to just a direct mapping rather than the pieces currently used; that will be fewer things for the compiler to worry about.

I still plan to do the specialized table, but hopefully this PR is landable separate from that.

CAD97 commented 6 years ago

Timeout

   Compiling unic-data v0.0.0 (file:///home/travis/build/behnam/rust-unic/data)
     Running `rustc --crate-name unic_data data/src/main.rs --crate-type bin --emit=dep-info,link -C debuginfo=2 -C metadata=dc3747200008d173 -C extra-filename=-dc3747200008d173 --out-dir /home/travis/build/behnam/rust-unic/target/debug/deps -C incremental=/home/travis/build/behnam/rust-unic/target/debug/incremental -L dependency=/home/travis/build/behnam/rust-unic/target/debug/deps --extern serde_derive=/home/travis/build/behnam/rust-unic/target/debug/deps/libserde_derive-ea47998e28307385.so --extern serde=/home/travis/build/behnam/rust-unic/target/debug/deps/libserde-a8e442193f4c8269.rlib --extern toml=/home/travis/build/behnam/rust-unic/target/debug/deps/libtoml-fb4ac857952b9e34.rlib --extern tokio_core=/home/travis/build/behnam/rust-unic/target/debug/deps/libtokio_core-e3da18abb1df214d.rlib --extern clap=/home/travis/build/behnam/rust-unic/target/debug/deps/libclap-a12ff8fc86df6425.rlib --extern hyper=/home/travis/build/behnam/rust-unic/target/debug/deps/libhyper-4fe365a952c190d5.rlib --extern futures=/home/travis/build/behnam/rust-unic/target/debug/deps/libfutures-455c535dc7e0a786.rlib`

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

I don't think I touched unic-data; ideas @behnam ? Last master build is still green, so I have no idea why it's timing out here.

Of note: it's timing out on 1.22 as well, so it's not a Rust change.

behnam commented 6 years ago

It's very strange for unic_data to fail on all builds, @CAD97. I got a similar time out on Travis once last week, only on one or two configs. It went well after hitting the Restart button. Maybe just try that once?

The AppVeyor failure seems to be unrelated to the Travis one. I believe serde-derive has some ongoing issues on nightly at the moment. There might be reports on that on https://github.com/rust-lang/rust already.

CAD97 commented 6 years ago

Rerunning Travis.

https://travis-ci.org/behnam/rust-unic/builds/355630492

CAD97 commented 6 years ago

All builds timed out on unic_data again. I'm going to ask Travis to run a build off of master, and see if that fails as well. I was having some build time oscillation on my local machine, but still don't really see what I might have changed for that crate's build perf, and I just did a cold build for unic_data at 1m16s on my local machine.

I'm scared that this is some sort of build indeterminacy being exacerbated by Travis's environment.

https://travis-ci.org/behnam/rust-unic/builds/355665777

CAD97 commented 6 years ago

Master build passed. I'm going to restart Travis on this PR one more time, then if that fails accross the board again I'll try and see what I can diagnose locally. @behnam if you've got a not-Windows machine, it'd be useful if you could run a cold build and see what you get. FWIW my last cold build --all took a total of ~21m (though I should note I built unic-data separately first).

CAD97 commented 6 years ago

Did one more build locally, cold build --all. Hung at unic-data at around 20% cpu utilization for >10m.

real    20m32.337s
user    0m0.031s
sys     0m0.015s
CAD97 commented 6 years ago

Just curious on how this compares since it's working from a cold cache anyway.

CAD97 commented 6 years ago

No difference. Will remove the de-incremental tomorrow.

behnam commented 6 years ago

Okay, I've tried on my MacBook Pro and get similar results.

Here's what I think is happening: unic-ucd-name is the crate with build problem, but we see unic-data as the last item in the list because that's the one with most external dependencies. When unic-ucd-name gets stuck, build for all other components await that, but unic-data gets cleared from deps and gets build eventually.

To see this, you can limit the parallel builds to one.

Now I'm going to take a look at the code and why the build halts.

behnam commented 6 years ago

Looks like this is back to the unknown situation where just touching the names table puts the compiler in seemingly-infinite loop.

Here's an idea to move forward and figure out the names problem later: how about you keep the new trait and type definitions along with the existing data type, and migrate over anything that doesn't break, and we will dig deeper into the ucd-name component afterwards, knowing the lower types are working fine.

behnam commented 6 years ago

FYI, I got unic-ucd-name compiled on this branch on my MacBook Pro in about 19 minutes:

$ time cargo +stable build --release --verbose --manifest-path unic/ucd/name/Cargo.toml
   Compiling unic-char-range v0.7.0 (file:///Users/behnam/code/unic/unic/unic/char/range)
   Compiling unic-common v0.7.0 (file:///Users/behnam/code/unic/unic/unic/common)
     Running `rustc --crate-name unic_char_range unic/char/range/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 --cfg 'feature="default"' -C metadata=f3ac911853d0e14a -C extra-filename=-f3ac911853d0e14a --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps`
     Running `rustc --crate-name unic_common unic/common/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 --cfg 'feature="default"' -C metadata=bbd743bcd5fc1b63 -C extra-filename=-bbd743bcd5fc1b63 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps`
   Compiling unic-ucd-version v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/version)
     Running `rustc --crate-name unic_ucd_version unic/ucd/version/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=eb58ae67be7e945f -C extra-filename=-eb58ae67be7e945f --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_common=/Users/behnam/code/unic/unic/target/release/deps/libunic_common-bbd743bcd5fc1b63.rlib`
   Compiling unic-char-property v0.7.0 (file:///Users/behnam/code/unic/unic/unic/char/property)
     Running `rustc --crate-name unic_char_property unic/char/property/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=5c7d5a21d125f65d -C extra-filename=-5c7d5a21d125f65d --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_char_range=/Users/behnam/code/unic/unic/target/release/deps/libunic_char_range-f3ac911853d0e14a.rlib`
   Compiling unic-ucd-hangul v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/hangul)
     Running `rustc --crate-name unic_ucd_hangul unic/ucd/hangul/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=290aab54d532b223 -C extra-filename=-290aab54d532b223 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_ucd_version=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_version-eb58ae67be7e945f.rlib`
   Compiling unic-ucd-name v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/name)
     Running `rustc --crate-name unic_ucd_name unic/ucd/name/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=cdc37663d19b71d4 -C extra-filename=-cdc37663d19b71d4 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_ucd_hangul=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_hangul-290aab54d532b223.rlib --extern unic_char_property=/Users/behnam/code/unic/unic/target/release/deps/libunic_char_property-5c7d5a21d125f65d.rlib --extern unic_ucd_version=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_version-eb58ae67be7e945f.rlib`
    Finished release [optimized] target(s) in 1192.87 secs

real    19m53.444s
user    19m43.632s
sys 0m4.710s
CAD97 commented 6 years ago

Oh, I know exactly why unic/ucd/name would be taking so long: too much slice. The eventual better solution to the char->name table will hopefully make that worry a thing of the past, but for now I'm going to add in the simplified CharMap<&str> rather than CharMap<&[&str]> to reduce the amount of work the compiler has to do. Hopefully. Performance should be about the same, but binary size might increase temporarily. Again, the new table I'm working on will fix this properly. But a cold compile time for --all of <200s is nice to have again.

CAD97 commented 6 years ago

I would make that a nice single commit that could be reverted, but that's a bit beyond my git-rebase skill at the moment.

CAD97 commented 6 years ago

gworsh, CI caught a failure in a different feature configuration! Let me fix that real quick, then this should be good for the full review @behnam.

behnam commented 6 years ago

Great!

One thing that would be easy to do on top of this diff is to put all the str slices in one big const and slice out of that. (Eventually, we can optimize the the string, but for now a simple concatenation should do the job.)

CAD97 commented 6 years ago

That would be possible, but relatively pointless, as rustc already is already handling it fine at this level of slice. I'm pushing for the better suffix compression + everything else specific that the name table can give us.

behnam commented 6 years ago

What should we do about this PR, @CAD97? Would you like to rebase and see if there are still any compile problems?

CAD97 commented 6 years ago

Rebased. Now waiting on CI to see if the rebase broke anything. Also testing locally to see if it works with 8d2bc79 reverted.

This map structure is still a win over the current structure, though of course the name mapping will be the biggest win once that gets done.

I should have a bit more time I can spend on this moving forward (though of course I can't guarantee anything), and have a few more ideas for table formats that can help in different situations -- the name-specific trie needs to be done first and then #231 gives me two separate ideas as well (one of the two is what the actual inspiration did).

CAD97 commented 6 years ago

With simplified name tables image Non-simplified name tables caused a timeout across the board. Removing the revert.