iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.48k stars 553 forks source link

Error out on inputs with f64 values and tensors. #8826

Open silvasean opened 2 years ago

silvasean commented 2 years ago

Describe the bug

Miscompile of the following Torch code when invoked with tensor<4xf64> (same happens for +)

    def forward(self, a):
        return a * a

Note: In my original test case, the results come out wrong, rather than hanging, with the following tensor statistics:

    FAIL - "ElementwiseMulTensorFloatModule_basic"
        @ trace item #0 - call to "forward"
        @ output of call to "forward"
        ERROR: value (Tensor with shape=[4] min=+0.0, max=+136.0, mean=+34.0) is not close to golden value (Tensor with shape=[4] min=+0.0, max=+9.0, mean=+3.5)

(the full value of the result computed by IREE is [ 0., 136., 0., 0.] instead of [0, 1, 4, 9])

To Reproduce

$ iree-run-mlir /tmp/repro.mlir -function-input="4xf64=0 1 2 3" -iree-hal-target-backends=dylib
EXEC @forward
result[0]: hal.buffer_view
<hang>

Expected result: Not to hang, and to give the result "4xf64=0 1 4 9".

#map = affine_map<(d0) -> (d0)>
module attributes {torch.debug_module_name = "ElementwiseMulTensorFloatModule"} {
  func @forward(%arg0: tensor<4xf64>) -> tensor<4xf64> {
    %0 = linalg.init_tensor [4] : tensor<4xf64>
    %1 = linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel"]} ins(%arg0 : tensor<4xf64>) outs(%0 : tensor<4xf64>) {
    ^bb0(%arg1: f64, %arg2: f64):
      %2 = arith.mulf %arg1, %arg1 : f64
      linalg.yield %2 : f64
    } -> tensor<4xf64>
    return %1 : tensor<4xf64>
  }
}
silvasean commented 2 years ago

I have another test case with a similar symptom:

module attributes {torch.debug_module_name = "NumToTensorFloatModule"} {
  func @forward() -> tensor<f64> {
    %cst = arith.constant 1.000000e+00 : f64
    %0 = linalg.init_tensor [] : tensor<f64>
    %1 = linalg.fill ins(%cst : f64) outs(%0 : tensor<f64>) -> tensor<f64>
    return %1 : tensor<f64>
  }
}

The returned value is +5.264e-315.

@benvanik any thoughts on what could be going wrong here? It feels like some sort of ABI marshaling / runtime thing for f64.

benvanik commented 2 years ago

f64 is not currently supported - I have no idea what passing an f64 at runtime will do (this will be my answer to all f64 issues filed, so I'm not sure it's immediately useful to file more distinct f64 issues)

benvanik commented 2 years ago

(if you are demoting f64->f32 then you'll need to pass in f32 - if you pass in f64 it's just going to reinterpret cast that to f32 and give you garbage)

silvasean commented 2 years ago

Okay. It looks like 24 out of my 25 remaining failing test cases have an f64 involved, so I will need to dig further into what exactly is happening here. I suspect that it's just an ABI boundary thing and not a fundamental issue in IREE.

silvasean commented 2 years ago

@benvanik It would be really useful if you could add some verification passes at the right places that prevent f64's from sneaking through unexpectedly.

I was digging further into some of these test cases, and I saw IR like this being created by DemoteF64ToF32

func @forward() -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi = "{\22a\22:[],\22r\22:[[\22ndarray\22,\22f64\22,0]],\22v\22:1}"}} {
  %cst = arith.constant 1.000000e+00 : f32
  %0 = linalg.init_tensor [] : tensor<f32>
  %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<f32>) -> tensor<f32>
  %2 = hal.tensor.export %1 : tensor<f64> as tensor<f32> -> !hal.buffer_view
  return %2 : !hal.buffer_view
}

It looks like there are two things that would concretely be useful and would have turned these e2e "wrong output" failures into simple compiler diagnostics:

  1. If hal.tensor.export reported errors for bitwidth mismatches (there is a big scary TODO(benvanik) there -- hopefully we could at least error on the "obvious" cases :) )
  2. In the places in the compiler where "we expect there to be no f64's because they are not supported", to ensure that we emit errors if we see them.
benvanik commented 2 years ago

Yeah, this is the problem with doing this stuff lower down in the stack vs in the frontends where we'd be able to ensure everything was consistent in the input :(

Note that the reflection information there is also incorrect and there's no way we're going to be parsing the JSON blob to update it. If you're planning on using that JSON blob (which is optional and not required by the runtime) you'll have to know not to trust it when it says f64.

benvanik commented 2 years ago

I also feel the best solution is to just reject inputs with f64s outright and not mess with the demotion - if someone can't change their model to not use them then I'm not sure we're the right target for them. At least, I don't think it's useful for us to play whack-a-mole with all the razor sharp edges it entails :(

silvasean commented 2 years ago

The reflection metadata is expected to be "incorrect" here -- that's the bread crumbs we use in the bindings to reconstruct the original dtype of the returned tensor, since we don't know if f32 is f32 or demoted f64 otherwise. Generally speaking, the point of the reflection metadata is to provide additional ABI information that is non-redundant with the IR -- e.g. a "dict" might be expanded into two lists -- it's up to the contract of the ABI to see "this was a dict" in the reflection metadata and know that it will be represented as two lists and how to get a dictionary back out. ABI's are inherently a "pact of mutual aggression" between the frontend and backend :)

I agree that having a pass which outright rejects f64 would be great -- but I think that present use cases require that it sit below the DemoteF64ToF32 pass, and then we can work to push it up the stack.

benvanik commented 2 years ago

I think the metadata needs to be extended to support that - what you're saying makes it impossible to distinguish cases where we do support f64 - you won't know if you see an f64 whether you need to convert that to f32 or pass it in directly. Each frontend binding needing to have this logic is really not great.

If the runtime side has f64 data then I'm not sure IREE with demotion is going to be a great choice with the current approach - it's going to require you to repack all the data each time you call into the program. I'm not sold on there being enough use cases for this kind of stuff such that it's worth baking fragile/transient assumptions into the system for. It's less sketchy in situations where the user is controlling the inputs ("I know I'm going to need to pass f32 buffers, and will pass f32 buffers, because I set the f64->f32 flag") but this kind of reflection-based metaprogramming goo layer stuff is just pain. You could achieve the same thing without any reflection metadata: if you get an f64 input then convert it to f32 and pass in the f32, and if you get an f32 out then it's really f32 data and can wrap it in an ndarray with the f32 data type - there's no need to pretend it's anything else. If there is a need that sounds like something that's use-case specific and not something that is universal and worth baking into the system (vs the frontend layer).

Do we have any legit use case for f64 where demotion to f32 is valid? I don't count "we have existing unchangable benchmark suites" as legit - you're not going to be able to compare numbers with a framework that's actually running in f64, for example, and this stuff is so full of sharp edges that it's always going to be a source of issues :/

benvanik commented 2 years ago

(not saying we can't make what we have work as a hack, more trying to signal that this is not a good path and something that shouldn't be allowed to ossify as part of the system)

silvasean commented 2 years ago

+1 on not ossifying. I think we're in this twilight zone where it is worth having a story, even if not great, to allow frontends to bootstrap with IREE and run at all, so that they can then weight the significant effort to bubble up this f64 issue to the frontends. Arguably current frontends don't even have the right abstractions for pushing this up properly (you would ideally want abstract "preferred_real" and "preferred_accumulation_real" etc. dtypes for users kind of like "index" -- but that would make all users rewrite their programs :/ ).

There is a certain elegance to doing this uniformly at this low level in the stack. Not saying that IREE should be responsible for it, but it actually isn't that ugly compared to the significant complexity of doing this at the frontend level.

You could achieve the same thing without any reflection metadata: if you get an f64 input then convert it to f32 and pass in the f32, and if you get an f32 out then it's really f32 data and can wrap it in an ndarray with the f32 data type - there's no need to pretend it's anything else.

On the input side I agree it is easy, however on the output side we do need to reconstruct whether it is f64 or f32 and cast back to the appropriate dtype -- our tests check the dtype is correct. E.g. tests for Torch's type promotion logic or things which are inherently f64 by user request.

silvasean commented 2 years ago

I do think that putting a Gandalf-style "you shall not pass" pass somewhere in the compiler would be useful given the present state.

image

benvanik commented 2 years ago

Agreed with the need for a flexible precision floating point type - shading languages have had this for ages for the same reason (lowp mediump highp etc). It also allows for the precision in which something is calculated to differ from the storage precision - even if we can't perform f64 math we could store f64s and do the math in a lower precision, etc. I don't think it's hard to get to the point where we can blanket accept f64 with the caveat that it's going to be slow/bloated/etc - it may be as simple as @antiagainst doing the load/store emulation and f64->f32 in the dispatches, or maybe we just say you require a GPU with f64 support (read: desktop AMD/nvidia) if you want to run the model. If a user complains that their model won't run on their phone then they can change their types.

I think there's a "frontend part of the compiler should fix types" thing like you mention but I'm more driving towards "user should fix types" as a first-order solution. It sounds like you're running tests that use f64: my assertion is that you shouldn't be running those tests, and engineering these kind of wonky solutions so that some arbitrary tests pass is not a great use of any of our time :)

Think of it like if you were trying to port some C that needlessly used double to a tiny processor: the pragmatic approach with the least risk of running afoul of toolchain/performance/size issues would be to s/double/float/g and move on, not try to get the entire SPEC 64-bit float benchmark suite going. If that's not possible for a particular input then there's likely some other major things hiding behind that which aren't possible either - in our case things like custom ops and other legacy ML badness.

So again, if there are legit models that people want to deploy with IREE where they absolutely cannot change a single line of python then they probably aren't good candidates for bootstrapping. I don't think mucking with types and doing other weird things while treating those inputs as black boxes is going to be worthwhile. Just IMO, but I'm really not enjoying this mess and am having a hard time justifying any effort on this vs all the other things that need to be done, and don't see this as something that should block any practically usable torch models :)

benvanik commented 2 years ago

(and agree with a gandalf defence - I'll go add a pass that says "don't use f64" if that lets us table these issues for now :)

silvasean commented 2 years ago

Okay, I think I will just XFAIL all the f64-related tests on IREE for now then.

benvanik commented 2 years ago

If that works that'd be great - will let us focus on the big blockers. I'd still like to get this all working without surprises and will add a pass to bail on f64 so the errors are clearer. I've also got the backlog of verifiers to add and since these hal ops are introduced really early I'll prioritize making those better first. In the meantime maybe we can motivate someone to do the f64 storage + f32 compute stuff so we could run the f64 stuff and not need the demotion at all.

benvanik commented 2 years ago

Thought of a different approach - always allow f64 input and perform the f64<->f32 conversion by reinterpret casting to i64 and shuffling the bits in linalg ops - then we compile as normal and the apps still see the f64 on the boundary. This wouldn't work if your tests are also checking for precision - are they?

(this would mean no ABI changes or anything user visible)

benvanik commented 2 years ago

(that also may not work for special cases, but really one shouldn't be sending in -inf and nan and stuff anyway)

benvanik commented 2 years ago

Well, I guess that's just the same idea I had before for fixing things up only in a different place and (hopefully) preventing some weirdness. It'd still need the SPIR-V side to support i64 load/store.

Naively:

func @exported_fn(%arg0: tensor<4xf64>) -> tensor<4xf64> {
  %ret0 = arith.addf %arg0, %arg0 : tensor<4xf64>
  return %ret0 : tensor<4xf64>
}

->

func @exported_fn(%arg0: tensor<4xf64>) -> tensor<4xf64> {
  %arg0_f32 = arith.truncf %arg0 : tensor<4xf64> -> tensor<4xf32>
  %ret0 = arith.addf %arg0, %arg0 : tensor<4xf32>
  %ret0_f64 = arith.extf %ret0 : tensor<4xf32> -> tensor<4xf64>
  return %0 : tensor<4xf64>
}

The issue with this would be that we could not guarantee that the trunc/ext wouldn't move around/get elided/etc. Maybe a named op:

func @exported_fn(%arg0: tensor<4xf64>) -> tensor<4xf64> {
  %arg0_f32 = linalg_ext.castf %arg0 : tensor<4xf64> -> tensor<4xf32>
  %ret0 = arith.addf %arg0, %arg0 : tensor<4xf32>
  %ret0_f64 = linalg_ext.castf %ret0 : tensor<4xf32> -> tensor<4xf64>
  return %0 : tensor<4xf64>
}

We could at least then control how that op behaves, but there's still no guarantee that other ops don't get inserted along the f64 edges though we'd be the ones that'd be inserting them so we could at least fix things ourselves.

Those cast ops would be implemented in something like SPIR-V/32-bit CPUs like: https://sourcegraph.com/github.com/freebsd/freebsd-src/-/blob/lib/libc/softfloat/bits64/softfloat.c?L2504-2526 https://sourcegraph.com/github.com/freebsd/freebsd-src/-/blob/lib/libc/softfloat/bits64/softfloat.c?L1543-1563

I still hate it :/ We should really just fully support f64 or bail.

silvasean commented 2 years ago

Thought of a different approach - always allow f64 input and perform the f64<->f32 conversion by reinterpret casting to i64 and shuffling the bits in linalg ops - then we compile as normal and the apps still see the f64 on the boundary. This wouldn't work if your tests are also checking for precision - are they?

(this would mean no ABI changes or anything user visible)

Our e2e tests all have a general tolerance (1e-5 or some such). They don't do super fine-grained checking that we really have 52 vs 24 bits of precision.

benvanik commented 2 years ago

I'll continue noodling on this to see if there's a clean path through. I took a look at making the hal ops update but where the pass happens (util) cannot depend on higher level dialects (hal). Could add an interface but that's more nasty workarounds that shouldn't otherwise exist so hesitant to go that route.

silvasean commented 2 years ago

I was looking at the pass -- would it break everything if we also generically convert TypeAttr's too? I think the issue here is a top-level TypeAttr that doesn't get converted. The existing code seems defensive about converting attributes, but the justification seems mostly about attrs with actual values (e.g. 1 : i64), rather than TypeAttr itself.

benvanik commented 2 years ago

We could try - there's sketchy corner cases and more implicit assumptions that come along with that (like if coming from tflite types may be wrapped in their custom type attrs, or if sparsity is used the encoding may be related to the element type, etc), but this whole approach is full of those anyway. It's a pool of alligators and if fine with taking a swim then what's one more alligator thrown in? I think switching the behavior so we fail outright on f64 unless you pass the demotion flag makes any badness that comes with the approach opt-in - a user should have to choose to jump into the pool -iree-abi-🤿-🐊=true :P

But I'm still against baking in implicit type conversion across the ABI boundary - we really don't want to be advertising one type but accepting another. Relying on inconsistent reflection metadata to do type casting is going to be really hard to unwind when trying to implement f64 for real (we're going to break you and your users, likely repeatedly, and in surprising ways). If you had a flag on your python bindings like torch.iree.whatever.AUTO_F64_TO_F32_CONVERSION = true that could switch the behavior in your interop layer that'd at least give a clear indication that this magic was happening and let us change that independently of what the compiler is doing.

Basically, if the compiler demotes f64 to f32 then the reflection information needs to be consistent and report f32. I still don't see practical uses case where a user would want the badness that would be f64<->f32 conversion to happen synchronously on the host as an implicit operation on the ABI boundary. I'd much rather have users sad because they can't yet use f64 than think we're orders of magnitude slower because it's happening under the covers and someplace we can't easily observe with profiling tools - we've already had some of that with the python layer where the IREE tools are all fine but things are slower in python and they've been really annoying to deal with. I'm also not sure someone who really wanted to use f64 would be happy with the precision being so far off and I'm not looking forward to triaging those reported accuracy issues.

There's also the 🤷 alternative, which mostly works today, which is to allow f64 only when using CUDA or something. CPU may work when targeting 64-bit but will likely require us to add all the softfloat emulation when targeting 32-bit systems. It'll also run afoul of vectorization/SIMD issues, I suspect. More uneven support for a corner case without motivation does not spark joy, though, and keeps the whole thing on the YMMV path.

I think the right approach is for us to do the conversion if required in the compiled code - then we could still have a flag for relaxed precision but don't need ABI changes. We just have to get our compiler handling the conversions right and everything else becomes an optimization problem we can control and have visibility into. I don't think it's a tremendous amount of work and mostly shared with what we need to do with i64, just something difficult to motivate given that f64 is not on the roadmap. If you can convince Lei to put this on his OKRs (SPIR-V is where we have the issues) then we'll have many more options :)

Concrete next steps in my mind would be starting with linalg codegen using f64 and ensuring our targets can handle it. Once they do we can easily reconfigure the the input layers to perform relaxed precision demotion without user-visible changes and avoid all this reflection stuff.

benvanik commented 2 years ago

(and sorry it feels like a reversal of the original plans - I hadn't considered that we'd be doing the reflection-based implicit conversion - that's my bad!)

silvasean commented 2 years ago

I hadn't considered that we'd be doing the reflection-based implicit conversion - that's my bad!

Sorry, I'm having trouble parsing this -- do you mean that you want to use reflection metadata as in https://github.com/google/iree/pull/8902 to capture the "original" types? I think you managed to convince me that that's not the best approach.

benvanik commented 2 years ago

Sorry (again), was more realizing that we had chatted about all this demotion stuff and thought we had it all worked out until the reflection stuff came up and I felt bad for dragging you through my realization process :) I don't believe reflection metadata + implicit conversion in the binding layer is the right way to go.

silvasean commented 2 years ago

@benvanik -- did we ever implement the gandalf pass for this? I recently saw some errors related to this deep in VMConversion.

I think we can probably close this after that -- the bug is basically "horrible user experience when attempting to use f64 which is not supported" -- getting it to a nice, deterministic compile time diagnostic seems sufficient to close this for now.

allieculp commented 2 years ago

@benvanik Is there any work remaining here? Attempting to triage with labels/priority.

benvanik commented 2 years ago

I don't think anything has been done and I'm unlikely to work on it. We have a VerifyInputLegality pass that could be extended here: https://github.com/google/iree/blob/main/compiler/src/iree/compiler/Dialect/Flow/Transforms/VerifyInputLegality.cpp#L24

allieculp commented 2 years ago

Propose to add to backlog?

silvasean commented 2 years ago

sg

benvanik commented 2 years ago

I think it's worth having but I'm not likely to do it - it's a perfect starter task :)