Open silvasean opened 2 years ago
I have another test case with a similar symptom:
module attributes {torch.debug_module_name = "NumToTensorFloatModule"} {
func @forward() -> tensor<f64> {
%cst = arith.constant 1.000000e+00 : f64
%0 = linalg.init_tensor [] : tensor<f64>
%1 = linalg.fill ins(%cst : f64) outs(%0 : tensor<f64>) -> tensor<f64>
return %1 : tensor<f64>
}
}
The returned value is +5.264e-315
.
@benvanik any thoughts on what could be going wrong here? It feels like some sort of ABI marshaling / runtime thing for f64.
f64 is not currently supported - I have no idea what passing an f64 at runtime will do (this will be my answer to all f64 issues filed, so I'm not sure it's immediately useful to file more distinct f64 issues)
(if you are demoting f64->f32 then you'll need to pass in f32 - if you pass in f64 it's just going to reinterpret cast that to f32 and give you garbage)
Okay. It looks like 24 out of my 25 remaining failing test cases have an f64 involved, so I will need to dig further into what exactly is happening here. I suspect that it's just an ABI boundary thing and not a fundamental issue in IREE.
@benvanik It would be really useful if you could add some verification passes at the right places that prevent f64's from sneaking through unexpectedly.
I was digging further into some of these test cases, and I saw IR like this being created by DemoteF64ToF32
func @forward() -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi = "{\22a\22:[],\22r\22:[[\22ndarray\22,\22f64\22,0]],\22v\22:1}"}} {
%cst = arith.constant 1.000000e+00 : f32
%0 = linalg.init_tensor [] : tensor<f32>
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<f32>) -> tensor<f32>
%2 = hal.tensor.export %1 : tensor<f64> as tensor<f32> -> !hal.buffer_view
return %2 : !hal.buffer_view
}
It looks like there are two things that would concretely be useful and would have turned these e2e "wrong output" failures into simple compiler diagnostics:
Yeah, this is the problem with doing this stuff lower down in the stack vs in the frontends where we'd be able to ensure everything was consistent in the input :(
Note that the reflection information there is also incorrect and there's no way we're going to be parsing the JSON blob to update it. If you're planning on using that JSON blob (which is optional and not required by the runtime) you'll have to know not to trust it when it says f64.
I also feel the best solution is to just reject inputs with f64s outright and not mess with the demotion - if someone can't change their model to not use them then I'm not sure we're the right target for them. At least, I don't think it's useful for us to play whack-a-mole with all the razor sharp edges it entails :(
The reflection metadata is expected to be "incorrect" here -- that's the bread crumbs we use in the bindings to reconstruct the original dtype of the returned tensor, since we don't know if f32 is f32 or demoted f64 otherwise. Generally speaking, the point of the reflection metadata is to provide additional ABI information that is non-redundant with the IR -- e.g. a "dict" might be expanded into two lists -- it's up to the contract of the ABI to see "this was a dict" in the reflection metadata and know that it will be represented as two lists and how to get a dictionary back out. ABI's are inherently a "pact of mutual aggression" between the frontend and backend :)
I agree that having a pass which outright rejects f64 would be great -- but I think that present use cases require that it sit below the DemoteF64ToF32 pass, and then we can work to push it up the stack.
I think the metadata needs to be extended to support that - what you're saying makes it impossible to distinguish cases where we do support f64 - you won't know if you see an f64 whether you need to convert that to f32 or pass it in directly. Each frontend binding needing to have this logic is really not great.
If the runtime side has f64 data then I'm not sure IREE with demotion is going to be a great choice with the current approach - it's going to require you to repack all the data each time you call into the program. I'm not sold on there being enough use cases for this kind of stuff such that it's worth baking fragile/transient assumptions into the system for. It's less sketchy in situations where the user is controlling the inputs ("I know I'm going to need to pass f32 buffers, and will pass f32 buffers, because I set the f64->f32 flag") but this kind of reflection-based metaprogramming goo layer stuff is just pain. You could achieve the same thing without any reflection metadata: if you get an f64 input then convert it to f32 and pass in the f32, and if you get an f32 out then it's really f32 data and can wrap it in an ndarray with the f32 data type - there's no need to pretend it's anything else. If there is a need that sounds like something that's use-case specific and not something that is universal and worth baking into the system (vs the frontend layer).
Do we have any legit use case for f64 where demotion to f32 is valid? I don't count "we have existing unchangable benchmark suites" as legit - you're not going to be able to compare numbers with a framework that's actually running in f64, for example, and this stuff is so full of sharp edges that it's always going to be a source of issues :/
(not saying we can't make what we have work as a hack, more trying to signal that this is not a good path and something that shouldn't be allowed to ossify as part of the system)
+1 on not ossifying. I think we're in this twilight zone where it is worth having a story, even if not great, to allow frontends to bootstrap with IREE and run at all, so that they can then weight the significant effort to bubble up this f64 issue to the frontends. Arguably current frontends don't even have the right abstractions for pushing this up properly (you would ideally want abstract "preferred_real" and "preferred_accumulation_real" etc. dtypes for users kind of like "index" -- but that would make all users rewrite their programs :/ ).
There is a certain elegance to doing this uniformly at this low level in the stack. Not saying that IREE should be responsible for it, but it actually isn't that ugly compared to the significant complexity of doing this at the frontend level.
You could achieve the same thing without any reflection metadata: if you get an f64 input then convert it to f32 and pass in the f32, and if you get an f32 out then it's really f32 data and can wrap it in an ndarray with the f32 data type - there's no need to pretend it's anything else.
On the input side I agree it is easy, however on the output side we do need to reconstruct whether it is f64 or f32 and cast back to the appropriate dtype -- our tests check the dtype is correct. E.g. tests for Torch's type promotion logic or things which are inherently f64 by user request.
I do think that putting a Gandalf-style "you shall not pass" pass somewhere in the compiler would be useful given the present state.
Agreed with the need for a flexible precision floating point type - shading languages have had this for ages for the same reason (lowp
mediump
highp
etc). It also allows for the precision in which something is calculated to differ from the storage precision - even if we can't perform f64 math we could store f64s and do the math in a lower precision, etc. I don't think it's hard to get to the point where we can blanket accept f64 with the caveat that it's going to be slow/bloated/etc - it may be as simple as @antiagainst doing the load/store emulation and f64->f32 in the dispatches, or maybe we just say you require a GPU with f64 support (read: desktop AMD/nvidia) if you want to run the model. If a user complains that their model won't run on their phone then they can change their types.
I think there's a "frontend part of the compiler should fix types" thing like you mention but I'm more driving towards "user should fix types" as a first-order solution. It sounds like you're running tests that use f64: my assertion is that you shouldn't be running those tests, and engineering these kind of wonky solutions so that some arbitrary tests pass is not a great use of any of our time :)
Think of it like if you were trying to port some C that needlessly used double
to a tiny processor: the pragmatic approach with the least risk of running afoul of toolchain/performance/size issues would be to s/double/float/g
and move on, not try to get the entire SPEC 64-bit float benchmark suite going. If that's not possible for a particular input then there's likely some other major things hiding behind that which aren't possible either - in our case things like custom ops and other legacy ML badness.
So again, if there are legit models that people want to deploy with IREE where they absolutely cannot change a single line of python then they probably aren't good candidates for bootstrapping. I don't think mucking with types and doing other weird things while treating those inputs as black boxes is going to be worthwhile. Just IMO, but I'm really not enjoying this mess and am having a hard time justifying any effort on this vs all the other things that need to be done, and don't see this as something that should block any practically usable torch models :)
(and agree with a gandalf defence - I'll go add a pass that says "don't use f64" if that lets us table these issues for now :)
Okay, I think I will just XFAIL all the f64-related tests on IREE for now then.
If that works that'd be great - will let us focus on the big blockers. I'd still like to get this all working without surprises and will add a pass to bail on f64 so the errors are clearer. I've also got the backlog of verifiers to add and since these hal ops are introduced really early I'll prioritize making those better first. In the meantime maybe we can motivate someone to do the f64 storage + f32 compute stuff so we could run the f64 stuff and not need the demotion at all.
Thought of a different approach - always allow f64 input and perform the f64<->f32 conversion by reinterpret casting to i64 and shuffling the bits in linalg ops - then we compile as normal and the apps still see the f64 on the boundary. This wouldn't work if your tests are also checking for precision - are they?
(this would mean no ABI changes or anything user visible)
(that also may not work for special cases, but really one shouldn't be sending in -inf and nan and stuff anyway)
Well, I guess that's just the same idea I had before for fixing things up only in a different place and (hopefully) preventing some weirdness. It'd still need the SPIR-V side to support i64 load/store.
Naively:
func @exported_fn(%arg0: tensor<4xf64>) -> tensor<4xf64> {
%ret0 = arith.addf %arg0, %arg0 : tensor<4xf64>
return %ret0 : tensor<4xf64>
}
->
func @exported_fn(%arg0: tensor<4xf64>) -> tensor<4xf64> {
%arg0_f32 = arith.truncf %arg0 : tensor<4xf64> -> tensor<4xf32>
%ret0 = arith.addf %arg0, %arg0 : tensor<4xf32>
%ret0_f64 = arith.extf %ret0 : tensor<4xf32> -> tensor<4xf64>
return %0 : tensor<4xf64>
}
The issue with this would be that we could not guarantee that the trunc/ext wouldn't move around/get elided/etc. Maybe a named op:
func @exported_fn(%arg0: tensor<4xf64>) -> tensor<4xf64> {
%arg0_f32 = linalg_ext.castf %arg0 : tensor<4xf64> -> tensor<4xf32>
%ret0 = arith.addf %arg0, %arg0 : tensor<4xf32>
%ret0_f64 = linalg_ext.castf %ret0 : tensor<4xf32> -> tensor<4xf64>
return %0 : tensor<4xf64>
}
We could at least then control how that op behaves, but there's still no guarantee that other ops don't get inserted along the f64 edges though we'd be the ones that'd be inserting them so we could at least fix things ourselves.
Those cast ops would be implemented in something like SPIR-V/32-bit CPUs like: https://sourcegraph.com/github.com/freebsd/freebsd-src/-/blob/lib/libc/softfloat/bits64/softfloat.c?L2504-2526 https://sourcegraph.com/github.com/freebsd/freebsd-src/-/blob/lib/libc/softfloat/bits64/softfloat.c?L1543-1563
I still hate it :/ We should really just fully support f64 or bail.
Thought of a different approach - always allow f64 input and perform the f64<->f32 conversion by reinterpret casting to i64 and shuffling the bits in linalg ops - then we compile as normal and the apps still see the f64 on the boundary. This wouldn't work if your tests are also checking for precision - are they?
(this would mean no ABI changes or anything user visible)
Our e2e tests all have a general tolerance (1e-5 or some such). They don't do super fine-grained checking that we really have 52 vs 24 bits of precision.
I'll continue noodling on this to see if there's a clean path through. I took a look at making the hal ops update but where the pass happens (util) cannot depend on higher level dialects (hal). Could add an interface but that's more nasty workarounds that shouldn't otherwise exist so hesitant to go that route.
I was looking at the pass -- would it break everything if we also generically convert TypeAttr's too? I think the issue here is a top-level TypeAttr that doesn't get converted. The existing code seems defensive about converting attributes, but the justification seems mostly about attrs with actual values (e.g. 1 : i64
), rather than TypeAttr itself.
We could try - there's sketchy corner cases and more implicit assumptions that come along with that (like if coming from tflite types may be wrapped in their custom type attrs, or if sparsity is used the encoding may be related to the element type, etc), but this whole approach is full of those anyway. It's a pool of alligators and if fine with taking a swim then what's one more alligator thrown in? I think switching the behavior so we fail outright on f64 unless you pass the demotion flag makes any badness that comes with the approach opt-in - a user should have to choose to jump into the pool -iree-abi-🤿-🐊=true
:P
But I'm still against baking in implicit type conversion across the ABI boundary - we really don't want to be advertising one type but accepting another. Relying on inconsistent reflection metadata to do type casting is going to be really hard to unwind when trying to implement f64 for real (we're going to break you and your users, likely repeatedly, and in surprising ways). If you had a flag on your python bindings like torch.iree.whatever.AUTO_F64_TO_F32_CONVERSION = true
that could switch the behavior in your interop layer that'd at least give a clear indication that this magic was happening and let us change that independently of what the compiler is doing.
Basically, if the compiler demotes f64 to f32 then the reflection information needs to be consistent and report f32. I still don't see practical uses case where a user would want the badness that would be f64<->f32 conversion to happen synchronously on the host as an implicit operation on the ABI boundary. I'd much rather have users sad because they can't yet use f64 than think we're orders of magnitude slower because it's happening under the covers and someplace we can't easily observe with profiling tools - we've already had some of that with the python layer where the IREE tools are all fine but things are slower in python and they've been really annoying to deal with. I'm also not sure someone who really wanted to use f64 would be happy with the precision being so far off and I'm not looking forward to triaging those reported accuracy issues.
There's also the 🤷 alternative, which mostly works today, which is to allow f64 only when using CUDA or something. CPU may work when targeting 64-bit but will likely require us to add all the softfloat emulation when targeting 32-bit systems. It'll also run afoul of vectorization/SIMD issues, I suspect. More uneven support for a corner case without motivation does not spark joy, though, and keeps the whole thing on the YMMV path.
I think the right approach is for us to do the conversion if required in the compiled code - then we could still have a flag for relaxed precision but don't need ABI changes. We just have to get our compiler handling the conversions right and everything else becomes an optimization problem we can control and have visibility into. I don't think it's a tremendous amount of work and mostly shared with what we need to do with i64, just something difficult to motivate given that f64 is not on the roadmap. If you can convince Lei to put this on his OKRs (SPIR-V is where we have the issues) then we'll have many more options :)
Concrete next steps in my mind would be starting with linalg codegen using f64 and ensuring our targets can handle it. Once they do we can easily reconfigure the the input layers to perform relaxed precision demotion without user-visible changes and avoid all this reflection stuff.
(and sorry it feels like a reversal of the original plans - I hadn't considered that we'd be doing the reflection-based implicit conversion - that's my bad!)
I hadn't considered that we'd be doing the reflection-based implicit conversion - that's my bad!
Sorry, I'm having trouble parsing this -- do you mean that you want to use reflection metadata as in https://github.com/google/iree/pull/8902 to capture the "original" types? I think you managed to convince me that that's not the best approach.
Sorry (again), was more realizing that we had chatted about all this demotion stuff and thought we had it all worked out until the reflection stuff came up and I felt bad for dragging you through my realization process :) I don't believe reflection metadata + implicit conversion in the binding layer is the right way to go.
@benvanik -- did we ever implement the gandalf pass for this? I recently saw some errors related to this deep in VMConversion.
I think we can probably close this after that -- the bug is basically "horrible user experience when attempting to use f64 which is not supported" -- getting it to a nice, deterministic compile time diagnostic seems sufficient to close this for now.
@benvanik Is there any work remaining here? Attempting to triage with labels/priority.
I don't think anything has been done and I'm unlikely to work on it. We have a VerifyInputLegality pass that could be extended here: https://github.com/google/iree/blob/main/compiler/src/iree/compiler/Dialect/Flow/Transforms/VerifyInputLegality.cpp#L24
Propose to add to backlog?
sg
I think it's worth having but I'm not likely to do it - it's a perfect starter task :)
Describe the bug
Miscompile of the following Torch code when invoked with
tensor<4xf64>
(same happens for+
)Note: In my original test case, the results come out wrong, rather than hanging, with the following tensor statistics:
(the full value of the result computed by IREE is
[ 0., 136., 0., 0.]
instead of[0, 1, 4, 9]
)To Reproduce
Expected result: Not to hang, and to give the result "4xf64=0 1 4 9".