Open CAD97 opened 6 years ago
The PR looks great so far!
Separating the keys and values into separate tables isn't just good for packing, it's also great for cache pressure since the values of all the keys not being searched for don't pollute the cache. The savings in memory being touched for searching should far outweigh the single cache miss looking up the final value.
Furthermore LLVM should be able to figure out when the tables have the same length to eliminate some bound checks.
Have you considered generating these rsv (heh, rust separated values) files in a custom build step instead of having them checked into source control? (I can imagine generating them on build every time slows down compile times too much...)
Surprisingly, the rsv generation is rather quick in release mode. Computers are fast, even with the simple > efficient mindset of the generation code's ownership model.
However, we want to ship the tables pre-generated when used as a library. It would be theoretically possible to not commit the files, but especially given the current generation script which does all of the tables at one time to reduce redundant work.
I agree that it'd be better to get the downloaded/generated files out of the git repository, but that's future work and it's just easier to commit for now.
Additionally, the large number of subcrates makes it even more interesting. The end goal might be unic-gen as a library and used in individual build scripts, but for now a single manual script is easier. It's already much better than the collection of Python that most repos like this run off of, as it's actually decently documented (and if it's not, that's mostly my fault).
Haven't looked at the diff yet, but just a quick reply to the conversion here...
rsv (heh, rust separated values)
The idea was just "Rust Value", since they can be included where a value is expected.
And, I believe we have rsd
, short for "Rust Definition", which contains items, to be included inside a module, trait or type definition.
However, we want to ship the tables pre-generated when used as a library. It would be theoretically possible to not commit the files, but especially given the current generation script which does all of the tables at one time to reduce redundant work.
One main reason we have decided to commit generated tables into the repo is that we want to make sure no-std components stay no-std, even in their build step. Generating these steps in no-std would add extra complexity, which IMHO is not necessary. Eventually, if we conclude that it's not a useful feature to have, we can move the generation code into build scripts.
On the source data side, we definitely don't want any build process/system to pull data over the web. It's bad for our reliability, and it's bad for unicode.org, etc servers. I'm working on setting up official git mirrors for Unicode data files. When those are available, we can make data/*
directories to be just git submodules, mirroring the official repos.
Overall, IMHO we have more important features to focus on right now, and the data/
and gen/
parts of the code-base are stable enough and don't need major changes (except table format changes which @CAD97 is working on). Of course if there are any issues with the process, we can start a new issue discussion and get to the details.
That merges in the changes since I started the branch.
We'll see how compile time goes on Travis, and if it's abysmal then I'll try the stopgap simplification of unic/ucd/name to just a direct mapping rather than the pieces currently used; that will be fewer things for the compiler to worry about.
I still plan to do the specialized table, but hopefully this PR is landable separate from that.
Timeout
Compiling unic-data v0.0.0 (file:///home/travis/build/behnam/rust-unic/data) Running `rustc --crate-name unic_data data/src/main.rs --crate-type bin --emit=dep-info,link -C debuginfo=2 -C metadata=dc3747200008d173 -C extra-filename=-dc3747200008d173 --out-dir /home/travis/build/behnam/rust-unic/target/debug/deps -C incremental=/home/travis/build/behnam/rust-unic/target/debug/incremental -L dependency=/home/travis/build/behnam/rust-unic/target/debug/deps --extern serde_derive=/home/travis/build/behnam/rust-unic/target/debug/deps/libserde_derive-ea47998e28307385.so --extern serde=/home/travis/build/behnam/rust-unic/target/debug/deps/libserde-a8e442193f4c8269.rlib --extern toml=/home/travis/build/behnam/rust-unic/target/debug/deps/libtoml-fb4ac857952b9e34.rlib --extern tokio_core=/home/travis/build/behnam/rust-unic/target/debug/deps/libtokio_core-e3da18abb1df214d.rlib --extern clap=/home/travis/build/behnam/rust-unic/target/debug/deps/libclap-a12ff8fc86df6425.rlib --extern hyper=/home/travis/build/behnam/rust-unic/target/debug/deps/libhyper-4fe365a952c190d5.rlib --extern futures=/home/travis/build/behnam/rust-unic/target/debug/deps/libfutures-455c535dc7e0a786.rlib` No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
I don't think I touched unic-data
; ideas @behnam ? Last master build is still green, so I have no idea why it's timing out here.
Of note: it's timing out on 1.22 as well, so it's not a Rust change.
It's very strange for unic_data
to fail on all builds, @CAD97. I got a similar time out on Travis once last week, only on one or two configs. It went well after hitting the Restart button. Maybe just try that once?
The AppVeyor failure seems to be unrelated to the Travis one. I believe serde-derive
has some ongoing issues on nightly
at the moment. There might be reports on that on https://github.com/rust-lang/rust already.
Rerunning Travis.
All builds timed out on unic_data again. I'm going to ask Travis to run a build off of master, and see if that fails as well. I was having some build time oscillation on my local machine, but still don't really see what I might have changed for that crate's build perf, and I just did a cold build for unic_data at 1m16s on my local machine.
I'm scared that this is some sort of build indeterminacy being exacerbated by Travis's environment.
Master build passed. I'm going to restart Travis on this PR one more time, then if that fails accross the board again I'll try and see what I can diagnose locally. @behnam if you've got a not-Windows machine, it'd be useful if you could run a cold build and see what you get. FWIW my last cold build --all took a total of ~21m (though I should note I built unic-data separately first).
Did one more build locally, cold build --all
. Hung at unic-data
at around 20% cpu utilization for >10m.
real 20m32.337s
user 0m0.031s
sys 0m0.015s
Just curious on how this compares since it's working from a cold cache anyway.
No difference. Will remove the de-incremental tomorrow.
Okay, I've tried on my MacBook Pro and get similar results.
Here's what I think is happening: unic-ucd-name
is the crate with build problem, but we see unic-data
as the last item in the list because that's the one with most external dependencies. When unic-ucd-name
gets stuck, build for all other components await that, but unic-data
gets cleared from deps and gets build eventually.
To see this, you can limit the parallel builds to one.
Now I'm going to take a look at the code and why the build halts.
Looks like this is back to the unknown situation where just touching the names table puts the compiler in seemingly-infinite loop.
Here's an idea to move forward and figure out the names problem later: how about you keep the new trait and type definitions along with the existing data type, and migrate over anything that doesn't break, and we will dig deeper into the ucd-name component afterwards, knowing the lower types are working fine.
FYI, I got unic-ucd-name
compiled on this branch on my MacBook Pro in about 19 minutes:
$ time cargo +stable build --release --verbose --manifest-path unic/ucd/name/Cargo.toml
Compiling unic-char-range v0.7.0 (file:///Users/behnam/code/unic/unic/unic/char/range)
Compiling unic-common v0.7.0 (file:///Users/behnam/code/unic/unic/unic/common)
Running `rustc --crate-name unic_char_range unic/char/range/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 --cfg 'feature="default"' -C metadata=f3ac911853d0e14a -C extra-filename=-f3ac911853d0e14a --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps`
Running `rustc --crate-name unic_common unic/common/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 --cfg 'feature="default"' -C metadata=bbd743bcd5fc1b63 -C extra-filename=-bbd743bcd5fc1b63 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps`
Compiling unic-ucd-version v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/version)
Running `rustc --crate-name unic_ucd_version unic/ucd/version/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=eb58ae67be7e945f -C extra-filename=-eb58ae67be7e945f --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_common=/Users/behnam/code/unic/unic/target/release/deps/libunic_common-bbd743bcd5fc1b63.rlib`
Compiling unic-char-property v0.7.0 (file:///Users/behnam/code/unic/unic/unic/char/property)
Running `rustc --crate-name unic_char_property unic/char/property/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=5c7d5a21d125f65d -C extra-filename=-5c7d5a21d125f65d --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_char_range=/Users/behnam/code/unic/unic/target/release/deps/libunic_char_range-f3ac911853d0e14a.rlib`
Compiling unic-ucd-hangul v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/hangul)
Running `rustc --crate-name unic_ucd_hangul unic/ucd/hangul/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=290aab54d532b223 -C extra-filename=-290aab54d532b223 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_ucd_version=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_version-eb58ae67be7e945f.rlib`
Compiling unic-ucd-name v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/name)
Running `rustc --crate-name unic_ucd_name unic/ucd/name/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=cdc37663d19b71d4 -C extra-filename=-cdc37663d19b71d4 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_ucd_hangul=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_hangul-290aab54d532b223.rlib --extern unic_char_property=/Users/behnam/code/unic/unic/target/release/deps/libunic_char_property-5c7d5a21d125f65d.rlib --extern unic_ucd_version=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_version-eb58ae67be7e945f.rlib`
Finished release [optimized] target(s) in 1192.87 secs
real 19m53.444s
user 19m43.632s
sys 0m4.710s
Oh, I know exactly why unic/ucd/name would be taking so long: too much slice. The eventual better solution to the char->name table will hopefully make that worry a thing of the past, but for now I'm going to add in the simplified CharMap<&str>
rather than CharMap<&[&str]>
to reduce the amount of work the compiler has to do. Hopefully. Performance should be about the same, but binary size might increase temporarily. Again, the new table I'm working on will fix this properly. But a cold compile time for --all
of <200s is nice to have again.
I would make that a nice single commit that could be reverted, but that's a bit beyond my git-rebase skill at the moment.
gworsh, CI caught a failure in a different feature configuration! Let me fix that real quick, then this should be good for the full review @behnam.
Great!
One thing that would be easy to do on top of this diff is to put all the str
slices in one big const and slice out of that. (Eventually, we can optimize the the string, but for now a simple concatenation should do the job.)
That would be possible, but relatively pointless, as rustc already is already handling it fine at this level of slice. I'm pushing for the better suffix compression + everything else specific that the name table can give us.
What should we do about this PR, @CAD97? Would you like to rebase and see if there are still any compile problems?
Rebased. Now waiting on CI to see if the rebase broke anything. Also testing locally to see if it works with 8d2bc79 reverted.
This map structure is still a win over the current structure, though of course the name mapping will be the biggest win once that gets done.
I should have a bit more time I can spend on this moving forward (though of course I can't guarantee anything), and have a few more ideas for table formats that can help in different situations -- the name-specific trie needs to be done first and then #231 gives me two separate ideas as well (one of the two is what the actual inspiration did).
With simplified name tables
Non-simplified name tables caused a timeout across the board. Removing the revert.
Supersedes #207
Step one of refactoring the table structure for better future growth. I'll fix the merge conflict soon, but wanted to get this out.
I've been playing around with the tables ever since the thread prompted by miri. This is an incremental move towards the table design that I think is what we want.
The enum has been slit into two distinct types. This skips the branch required before access.
Both of the tables are now indirect -- this means having a separate
&[char]
and&[T]
, so everything should pack better. Asslice::binary_search
returns an index anyway, this seems like an all-around better choice for our associative slices, since space compression is what we're mostly concerned with, and the cyclomatic cost remains the same. I suspect any speed lost by not having the value next to the key should be gained in the better packing of data for the binary search. See the first commit for the full argument.After this, the next step will be to add a table specifically for the
char->Name
translation. See the multiple places I've markedTODO(CAD97)
for the upcoming work.If we ever get fields with a type of
impl Trait
, it would probably be prudent to add a trait for theCharMap
shape, so that the table structure can be changed in the future without affecting the public API surface. (Along those lines, I think there might be a public dependency in canonical_composition_mapping that we'd want to get rid of before bringingunic-ucd
to 1.0, based on what I had to modify.)r? @behnam ping @CasualX; this sort of integrates some of the path you took.
This change is