rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.63k stars 12.48k forks source link

compilation time (of volatile array read?) is super-quadratic in array size #74476

Open ijackson opened 4 years ago

ijackson commented 4 years ago

Hi. I tried this code:

const COUNT : usize = 40000;

fn main () {
  let mut vol_buffer = [ 0u64; COUNT ];

  let mut input = unsafe { std::ptr::read_volatile(&mut vol_buffer) };

  input[0] = 42;
  eprintln!("{:?}", input[0]);
}

Expected results:

Relatively quick compilation of a program that prints 42.

Actual results:

rustc took 45 seconds on my really rather fast laptop (and also uses a great deal of memory). The compiled binary works as epected.

Changing the value of COUNT shows the following (with an earlier rustc, but the newly-updated one is about as slow):

10000 1.74s 20000 5.9s 30000 20s 40000 46s

I haven't waited for completion of any larger values. In my original version I had COUNT = a million; I killed rustc after some minutes and about 2G of RAM use.

Meta

rustc --version --verbose:

binary: rustc
commit-hash: 346aec9b02f3c74f3fce97fd6bda24709d220e49
commit-date: 2020-07-11
host: x86_64-unknown-linux-gnu
release: 1.46.0-nightly
LLVM version: 10.0
hanna-kruppe commented 4 years ago

A tremendous amount of machine code is generated (depending on COUNT) which of course takes time. The volatile load is loading a huge value, and because it's volatile, the generated code has to ensure every byte is loaded exactly once, so it can't just call memcpy. I don't see any way to avoid this problem, but I also wonder why you're doing a volatile load of the entire array in the first place.

hanna-kruppe commented 4 years ago

(Well, the amount of code generated should be linear in COUNT, not quadratic or worse, but since it's all in one basic block, one of the many passes over that code probably is non-linear in the amount of code being processed, and avoiding that is probably very hard and not worth the effort.)

the8472 commented 4 years ago

couldn't this be rewritten as a loop of volatile u64 reads?

hanna-kruppe commented 4 years ago

You're right that a loop would be a legal lowering, but I don't think there's a good way to actually generate that within the the context of how CodeGen and especially legalization is done in LLVM today. But I just remembered that "volatile memcpy" is now a thing in LLVM, so perhaps rustc could be modified to emit that for sufficiently large values? But this can call the ordinary memcpy (which may perform multiple accesses to the same byte) and more generally it's unspecified what memory accesses it performs, so it might not be what users of {read,write}_volatile expect or want. But once again, I don't know why OP has written the code like this in the first place, so it's hard to prioritize the issue or suggest workarounds.

ijackson commented 4 years ago

Thanks etc.

Hi again. I'm afraid I don't have much to add in the discussion of compiler internals.

My program does not need to do such a large volatile read and I have changed it not to - ie, now it reads in smaller chunks. So this bug is not a practical problem for me and I'm not looking for help :-).

I thought I would report this as a bug as the behaviour seemed, to me as a non-compiler-engineer, quite surprising. It seemed to me that it was propbably not intended and that you'd like to know about it.

Rust is usually very helpful and forgiving. (Admittedly, here I used unsafe, and unsafe is definitely treading hazardous ground.) I am trying to help make Rust more helpful and forgiving.

I don't know how common this kind of program is. If you have seen others have similar experiences, maybe at least some kind of compiler warning would be helpful? If you think there's little that could be done here, to make matters better, then fair enough and please close this issue.

What I was trying to do

Anyway, in the hope it's useful and/or interesting to you, I'll explain what I was trying to do. Try not to laugh too hard:

I used this technique because I was doing some benchmarking and needed to defeat the optimiser, which could reasonably otherwise have realised that my entire program was a no-op. I could have used IO instead but the function I was benchmarking was a very complex way of describing the identity function.[1] The shortest (and to me most obvious) expression in Rust code of this concept was have one single volatile read of a whole array of input values.

I'm not aware of a better way to make a barrier for the optimiser. IMO ideally there would be an operator, in std::mem perhaps, which would take a value and return it, but guarantee that the compiler would not optimise 'through' it. Ie, an identity function the compiler does not know the innards of: so the compiler must compute precisely the input value, and then make sure that all uses of the returned value use that one computation. You could then benchmark f by computing opt_barrier(f(opt_barrier(test_value))). A volatile memory barrier isn't quite right because it requires that the value is in memory. But maybe that is one for the language team.

Ian.

[1] Read only if you have a strong stomach: https://github.com/orlp/slotmap/pull/43#issuecomment-660546928 The benchmark program (now working, with a smaller volatile read) is attached.

Incidentally I am very impressed with how Rust managed to deal with that ludicrous contraption. Out of curiousity I looked at the generated assembler for 32-bit ARM and the relevant method contains simply one instruction: bx lr. Well done to all involved!

slotmap-slot-idx-test.rs.txt

the8472 commented 4 years ago

I'm not aware of a better way to make a barrier for the optimiser.

https://docs.rs/bencher/0.1.5/bencher/fn.black_box.html https://doc.rust-lang.org/std/hint/fn.black_box.html https://doc.rust-lang.org/test/bench/fn.black_box.html

ijackson commented 4 years ago

Interesting, thanks. That is precisely right except it says it's not guaranteed to actually work :-).

the8472 commented 4 years ago

It's generally considered good enough for benchmarking, you can always look at the assembly output to make sure.

ijackson commented 4 years ago

I ended up (skim)reading a lot of the discussion in the various black_box issues etc.

I think what is needed is what is described in this comment: https://github.com/rust-lang/rfcs/issues/1484#issuecomment-182392175 Unfortunately what is provided is something considerably weaker. As documented, black_blox is not really useful even for benchmarking because (as you suggest) you still have to read the generated assembler to know if it worked. It's also no good for crypto constant-time stuff, which was another use case. But I think that discussion should probably be in some other issue.

In this issue: is it worth trying to make the compiler not quadratic in the size of the volatile array read? Or providing a warning about this or similar situations? If not then I think probably you'll want to close this issue, and sorry for wasting your time.

Thanks for your attention.

hanna-kruppe commented 4 years ago

I do think we should deprecate and warn about volatile reads and writes like this one. Generally, volatile accesses for aggregates (excluding those that map to architecturally-supported data types, such as repr(transparent) newtypes around primitives or SIMD types) have few legitimate use cases and several gotchas (no real guarantees about hardware memory access granularity, bad codegen as seen here). I am aware of a few use cases, but those have different needs from e.g. MMIO uses, so a different interface for them (cf. LLVM's volatile memcpy) seems like a better choice than overloading these operations. cc @rust-lang/wg-unsafe-code-guidelines

RalfJung commented 4 years ago

Cc https://github.com/rust-lang/unsafe-code-guidelines/pull/212