Open ghost opened 5 years ago
@wyessen Could you provide C2 models for your test-case, so that we can reproduce it locally?
Also, could you check under debugger, which function and argument are failing in the assert?
@opti-mix You should have that info in the back trace. The highest level function triggering this is glow::Loader::compile
. The call propagates to glow::LLVMIRGen::createCall
, and that's where the signal handler gets invoked. The assertion is failing in LLVMIRCodeGen/LLVMIRGen.cpp, line 558.
I'll provide C2 protos a bit later. Do you need the source code for x-model-runner
too?
OK. I think I figured out what is the issue. It is related to size_t.
libjit
has a lot of kernels whose parameters have a type size_t
. But sizeof(size_t) == 8
on 64bit systems and sizeof(size_t) == 4
on i386.
So, libjit
is currently compiled for x86_64, but it's client, i.e. the bundle, is compiled according to the -target
command-line parameter, i.e. for i386. As a result, the client thinks that size_t
is 32 bits and libjit
thinks it is 64 bits. Thus the mismatch.
Also, could you check under debugger, which function and argument are failing in the assert?
I meant what is the callee
argument of createCall
, i.e. the function being called from the bundle. But it doesn't matter. I could reproduce locally with a different network as the issue has nothing to do with the specific network as I explained above.
Right. I expected that also. By the way, this is giving off a strong design smell.
Let's consider possible solutions. All of them are relatively easy to implement.
libjit
instead of size_t
. But this could be not so optimal in terms of performance.libjit
in the same backend, one per possible platform size, e.g. one for x86_64 and one for i386By the way, this is giving off a strong design smell.
This is not a super constructive comment.
@opti-mix In order to comment on the proposed solutions, I need to better understand where and how libjit
comes in. I haven't dug that deep into Glow. So that would be a question for the contributors to answer, but thanks for looking into this.
I suppose one workaround for now would be to build Glow for i386. Could you confirm? In that case, what is the easiest way to force this (i.e. cmake flags?).
In order to comment on the proposed solutions, I need to better understand where and how libjit comes in.
libjit
is a library of kernels. See here: https://github.com/pytorch/glow/tree/master/lib/Backends/CPU/libjit. When Glow generates a bundle, it creates an LLVM function that calls those kernels as required by your C2 model.
I suppose one workaround for now would be to build Glow for i386. Could you confirm? In that case, what is the easiest way to force this (i.e. cmake flags?).
Yes. One could basically build libjit
or the whole Glow for i386 (but then it won't work with x86_64 using this build). One would need to adjust CMake flags. Most likely one could modify this place:
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CMakeLists.txt#L16 to just compile libjit
for i386.
@opti-mix Understood, thanks. The name was throwing me off, and I was reluctant to ask whether this was referring to the GNU's LibJit (which would have been strange).
Slightly off topic: Glow paper talks about reduction from high level IR to low level IR, one of the advantages being that backend designers should only have to implement the low level linear algebra routines. Is libjit
the place where those low level linear algebra routines are implemented? In particular I'm looking at libjit_matmul.cpp
.
If that is indeed the case, then it would make sense to go with your second suggested solution: keep different versions of libjit
on per-target basis. We'd need some kind of a mechanism for selecting the correct version based on -target
.
Understood, thanks. The name was throwing me off, and I was reluctant to ask whether this was referring to the GNU's LibJit (which would have been strange).
Yeah, the name is a misnomer ;-) It is called like this for historical reasons, because it is used from the CPU/JIT backend. It should be probably renamed into something more reasonable like libkernels
or something like that.
Slightly off topic: Glow paper talks about reduction from high level IR to low level IR, one of the advantages being that backend designers should only have to implement the low level linear algebra routines. Is libjit the place where those low level linear algebra routines are implemented? In particular I'm looking at libjit_matmul.cpp.
Yes, libjit
is the place where kernels are implemented. And the low-level IR is defined by IR.h
. You have IRFunctions
, which consist of Instructions
. And when the CPU backend lowers this IR for a specific target it converts instructions into calls of the kernels defined in libjit
.
If that is indeed the case, then it would make sense to go with your second suggested solution: keep different versions of libjit on per-target basis. We'd need some kind of a mechanism for selecting the correct version based on -target.
The problem with this is that in theory Glow allows you to compile for any target supported by LLVM. Which exact targets are supported by your installation of LLVM is not known in advance. But libjit
is built when you build Glow, not when you compile your model. So, it is unclear how we can build all different target-specific versions of libjit
at Glow's build-time if we don't know which targets we need to support. Of course, we could just ask to provide a list of targets as a CMake argument and build only for targets from this list.
The problem with this is that in theory Glow allows you to compile for any target supported by LLVM. Which exact targets are supported by your installation of LLVM is not known in advance. But libjit is built when you build Glow, not when you compile your model. So, it is unclear how we can build all different target-specific versions of libjit at Glow's build-time if we don't know which targets we need to support. Of course, we could just ask to provide a list of targets as a CMake argument and build only for targets from this list.
Point taken. The reason I suggested the second solution is because Glow already supports custom backends (for which clients are responsible), and in this case we have a backend but a potentially custom architecture. Should Glow make clients responsible for the backend architecture also? Glow could provide a few implementations out of the box for different architectures, and ask the client to link with the correct library.
In this case, your suggestion to ask for a list of targets sounds like a good one to me. A use case that I have in mind is something like this: an organization deploys neural networks on particular hardware. They build Glow once, set up their build system, and build for their targets going forward (not requiring Glow rebuild).
Basically, backends are components that are not viewed as part of Glow proper (and they shouldn't be... at least that is my understanding of Glow's design). Similarly, backend kernels that target different hardware architectures should be treated as interchangeable modules.
Aha! So on another note: the libjit_
symbols are not exposed, say through .h
, anywhere. I wonder if it may be a good idea to expose them? For instance, what if I want to write my own implementation for some of those functions, for whatever reason? Or, like in this case, I want to provide my own implementation of the entire libjit
(not quite, but... close enough to an actual use case).
On a somewhat related note, I wonder if we'd be able to target devices without an FPU? How does Glow quantization work in this case?
If something like that is not supported out of the box, then that is a candidate use case where one may need to heavily modify libjit
(and possibly other parts).
Also, looking at CMakeLists.txt in the CPU Backend, it isn't clear to me whether we'd be able to get away with just building the 32 bit version of libjit
(or the entire CPU backend) while having the rest of Glow built in 64bit.
So maybe the solution of having a custom libjit
per architecture might be too much work for what it's worth. On the other hand, it sounds like a necessity (given the possible use cases outlined above).
Sorry, closed accidentally. Reopened.
On a somewhat related note, I wonder if we'd be able to target devices without an FPU?
If target devices have soft-FP (i.e. emulated in software), Glow works just fine. If target devices do not support FP in any form, Glow can work with models that are completely quantized.
How does Glow quantization work in this case?
Quantization happens when Glow compiles network using a quantization profile (e.g. when using the-load-profile
option) and produces a bundle, not when the bundle runs. All uses of FP-kernels are replaced by quantized kernels, all weights are quantized, etc. So, no FP is required by the resulting bundle (may be it still needs FP to quantize the input at the very beginning and the output at the very end, but it is relatively easy to get around that by doing it e.g. on the host-side).
Also, looking at CMakeLists.txt in the CPU Backend, it isn't clear to me whether we'd be able to get away with just building the 32 bit version of libjit (or the entire CPU backend) while having the rest of Glow built in 64bit.
I think it should work. libjit is used for something that will run on the device. Glow itself usually runs on the host (though it could run on the device if there is a wish to do so) and it shouldn't matter if Glow itself is 64bit or not.
Regarding providing your own kernels: It is totally possible. The usual way to do this would be to create a new backend which is derived from the class LLVMBackend, just like CPUBackend. This new backend can add/remove/modify any libjit
kernels and change some/all parts related to the LLVM IR code generation. Basically, each backend derived from LLVMBackend can provide its own version of libjit
.
Also, looking at CMakeLists.txt in the CPU Backend, it isn't clear to me whether we'd be able to get away with just building the 32 bit version of libjit (or the entire CPU backend) while having the rest of Glow built in 64bit.
I just checked. Adding -target i386
to the mentioned CMake file solves the issue for i386. It required some minor tweaks for include files used by libjit
as I don't have i386 cross-compiler include files on my system. With this change I was able to produce an i386 bundle without any issues.
@opti-mix Great, thank you! Lots of useful info. Could you please provide the details of your CMake modification so that I can try to reproduce your success exactly?
Great, thank you! Lots of useful info. Could you please provide the details of your CMake modification so that I can try to reproduce your success exactly?
For CMake I just added this line -target i386
to options:
set(CPURunttimeCompilationOptions
-std=c++14
-ffast-math
-fno-finite-math-only
-g0
-target i386
-emit-llvm
-O0)
try to compile and there will be a couple of complaints about missing include files or missing definitions of some functions, but this is trivial to fix. You can comment out some of the places where they are used as they are not really important. You can also replace some C++ includes by POSIX C-includes. After that everything should compile.
Yeah, I can confirm that it worked correctly. I think we can close the issue with the following comments:
The client must make sure that the backend kernels that they are using are compatible with the architecture they are compiling for. For instance, for the CPU, this entails building libjit
for the specific architecture.
The client may then use the -target
option (and the -mcpu
option with the CPU backend) with the client's model runner to emit bundles for the specific target/architecture.
Could you confirm that the -mcpu
is precisely the same as it is in LLVM proper (that is, it isn't an alias for something Glow-specific)?
Could you confirm that the -mcpu is precisely the same as it is in LLVM proper (that is, it isn't an alias for something Glow-specific)?
Yes. You can see in the sources of LLVMBackend.cpp
in Glow that -mcpu
is just passed down to LLVM's initTargetMachine
method. Glow does not do anything fancy with this option.
Yeah, I can confirm that it worked correctly.
Awesome!
I think we can close the issue with the following comments: The client must make sure that the backend kernels that they are using are compatible with the architecture they are compiling for. For instance, for the CPU, this entails building libjit for the specific architecture.
The client may then use the -target option (and the -mcpu option with the CPU backend) with the client's model runner to emit bundles for the specific target/architecture.
Any improvements of Glow docs are very welcome! Would you be willing to submit a small PR to update the AOT.md
with this cross-compilation related information?
Thank you! Well, as a small service fee, I can definitely help improve the docs. Give me a few days, though. I should have some time this weekend.
Let's keep this issue open until the PR is in, just in case.
For armv7l-unknown-linux-gnueabihf target, I also (on top of suggested change for CPURunttimeCompilationOptions) had to modify hardcoded sizes of 8 (to 4) in void BundleSaver::emitSymbolTable() and void BundleSaver::emitBundleConfig() - compilation would go fine without that change, but execution would fail.
Regarding modification of missing include files etc.., following change looks less intrusive (same Backends/CPU/CMakeLists.txt): ... add_custom_command( OUTPUT ${libjit_obj_file}
@djordje-dj-s I cannot confirm your CMake fix. What system are you on? Clang should select the correct stdlib
automatically; most of the time there is no need for an explicit -stdlib=
flag. At any rate, this didn't work for me with -target i386
.
Regarding fixed sizes: yeah, that makes sense. In fact, I think a good fix would be to replace the fixed width types with std::size_t
. In that case, in the emitted bundle, the structs (BundleConfig
and SymbolTableEntry
) will contain the correctly-sized members. @opti-mix can you confirm?
@opti-mix By the way, what is the reason for such large alignment requirement (64 bytes, hard-coded). Was this meant to be 8 bytes (64 bits)? That is, is this meant to be the strictest scalar type alignment (sizeof(long double)
for instance, where supported)? If not, what affects calculation of the alignment
member in BundleConfig
?
I don't see any logic for its calculation in the code. This is what lives in Memory.h
:
/// The tensor payload is allocated to be aligned to this value.
constexpr unsigned TensorAlignment = 64;
And then it is used in building the BundleConfig
structure in BundleServer::emitBundleConfig
:
llvm::ConstantInt::get(uint64TType, TensorAlignment)
@wyessen , I am on Ubuntu 18.04LTS, building for ARMv7. Also building with LLVM dependency (in custom folder).
@opti-mix Looks like these should depend on target
, cpu
, arch
, and targetFeatures
:
auto *bundleConfigTy =
llvm::StructType::get(irgen_->getLLVMContext(),
{uint64TType, uint64TType, uint64TType, uint64TType,
uint64TType, symbolTableEntryTy->getPointerTo()});
and
auto symbolTableEntryTy = llvm::StructType::get(
irgen_->getLLVMContext(),
{charTy->getPointerTo(), uint64TTy, uint64TTy, charTy});
@djordje-dj-s
@wyessen , I am on Ubuntu 18.04LTS, building for ARMv7. Also building with LLVM dependency (in custom folder).
OK, I need to jump on my ubuntu VM to confirm. But this didn't work for me on OS X, so even if it works for you on Ubuntu, looks like it isn't a universal solution.
Another hard coded value (8->4) is definitely not a good solution - this was done for investigation purposes only. Target sizeof() is one step closer, but as you indicated, more details need to be taken into account.
As one more data point, in order to use custom llvm/clang on my Ubuntu (e.g. I have llvm-6 installed on my Ubuntu, system-wide, but for glow using llvm-8), I had to make one more temp change (regarding same CMakeLists.txt): ... if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang") set(CLANG_BIN ${CMAKE_CXX_COMPILER}) else() find_program(CLANG_BIN clang++) +set(CLANG_BIN "/home/djordje/GLOW/llvm_install/bin/clang++") endif()
@wyessen
By the way, what is the reason for such large alignment requirement (64 bytes, hard-coded). Was this meant to be 8 bytes (64 bits)?
SIMD processing of tensors using e.g. AVX, SSE, etc. You want to process e.g. multiple FP or int8 values at once. Some of those instructions have certain requirements on the alignment of data to get the best performance.
Regarding fixed sizes: yeah, that makes sense. In fact, I think a good fix would be to replace the fixed width types with std::size_t. In that case, in the emitted bundle, the structs (BundleConfig and SymbolTableEntry) will contain the correctly-sized members.
BundleConfig
and SymbolTableEntry
used to have size_t
members. We decided to use fixed width ints later, so that you can read these tables in the same way on all platforms.
@wyessen Have you seen #2847? This make your use-case work without any CMake hacks.
@opti-mix Yep, that should do it. But I think the issues in BundleServer.cpp
still need to be addressed in a similar way.
@opti-mix Oh, and by the way, this doesn't completely resolve things. libjit
still needs to be compiled for the target in the specific way that bundles are compiled. So the -target
option is still required for libjit
(along with any other options like mcpu
, mfloat-abi
, and so on. The only way to avoid this is to not compile libjit
into a library, but instead link with libjit
's sources when compiling bundles.
Basically, we cannot cross-link objects compiled for different architectures, even if the sizes of scalar types match.
@wyessen
Yep, that should do it. But I think the issues in BundleServer.cpp still need to be addressed in a similar way.
I'm not sure I understand the issue with BundleSever.cpp
. What is the exact problem?
@opti-mix
@wyessen
Yep, that should do it. But I think the issues in BundleServer.cpp still need to be addressed in a similar way.
I'm not sure I understand the issue with
BundleSever.cpp
. What is the exact problem?
I mean things like uint64TType
in BundleServer::
. Are they guaranteed to collapse to supported types when compiling for a particular architecture? That's the problem that @djordje-dj-s encountered (I haven't had the chance to try it out yet myself).
I mean things like uint64TType in BundleServer::. Are they guaranteed to collapse to supported types when compiling for a particular architecture? That's the problem that @djordje-dj-s encountered (I haven't had the chance to try it out yet myself).
Yes, uint64TType is defined as a fixed width type:
auto *uint64TType = irgen_->getBuilder().getIntNTy(sizeof(uint64_t) * 8);
It will have the same size on any platform.
@opti-mix Exactly. So that creates an integer 64 bits wide. Now in my bundle .o
file, I have BundleConfig
containing 64bit integer fields. Is that correct? If so, how would this be supported on a 32bit platform?
In particular, when I do inference, I need something like this:
struct BundleConfig
{
uint64_t constantWeightVarsMemSize{};
uint64_t mutableWeightVarsMemSize{};
uint64_t activationsMemSize{};
uint64_t alignment{};
uint64_t numSymbols{};
const SymbolTableEntry *symbolTable{};
};
extern "C" BundleConfig my_bundle_config
So on a 64bit platform, that is not a problem - I can deserialize my_bundle_config
just fine by interpreting it as BundleConfig
(as defined above).
Now, what if I'm on a 32bit platform?
Now, what if I'm on a 32bit platform?
My understanding is that on a 32bit platform uint64_t is defined anyways (probably as alias for long long unsigned int or something like that). So, there should be no problem for a compiler to work with uint64_t. Or do you tell me uint64_t is not supported on 32bit platforms?
No, forget what I said. I'm overthinking things now. We should be OK.
It seems that the root cause is that I have not kept updated structures in my inference code based on glow/examples/bundles/lenet_mnist.cpp (and resnet50.cpp). Earlier, I worked around the problem (size_t on host not being equalt to size_t on target) with changes in BundleServer.cpp. But new definition of SymbolTableEntry and BundleConfig structures includes members that are not size_t anymore but int64_t. This is what I have missed to update on my end (on inference side).
It seems that the root cause is that I have not kept updated structures in my inference code based on glow/examples/bundles/lenet_mnist.cpp (and resnet50.cpp). Earlier, I worked around the problem (size_t on host not being equalt to size_t on target) with changes in BundleServer.cpp. But new definition of SymbolTableEntry and BundleConfig structures includes members that are not size_t anymore but int64_t. This is what I have missed to update on my end (on inference side).
Yep. @opti-mix is right, there is no issue now. I wasn't able to reproduce your issue either.
Just checked tip of master (including #2847 merge), including execution on ARMv7 platform, and both AOT examples work OK.
@djordje-dj-s Are you compiling on the target platform, or are you cross-compiling for the target platform?
@wyessen , I was compiling inference example, and linking with network object code, on the target platform.
@djordje-dj-s Ok, then this is different from the original problem discussed here. I've been trying to cross compile for a target program.
Just to be completely clear: Bundle (e.g. resnet50.o, resnet50.weights) was created on the host platform (Ubuntu 18.04, x86_64): ../bin/image-classifier ../tests/images/imagenet/zebra_340.png -model ../resnet50/model.onnx -use-imagenet-normalization -image-mode=0to1 -relocation-model=pic -model-input-name="gpu_0/data_0" -cpu -target armv7l-unknown-linux-gnueabihf -emit-bundle ./bundle_dir/ -network-name=resnet50
Object file (e.g. resnet50.o ) was linked on the target, with the inference application compiled also on the target (armv7): g++ main.cpp resnet50.o -lm -lpng -fpack-struct=8 -fpermissive -o infer_resnet50
@djordje-dj-s Ok, thanks. Yeah, I am trying to compile the inference app on host for the target, and am doing linking on host as well.
I think this ticket can now be closed. @opti-mix ?
@wyessen I'm OK closing it. But you wanted to add something to the docs to clarify some aspects of building/using Glow for cross-compilation scenarios. Would you still be willing to improve docs about it?
I have tried emitting bundles for targets other than x86_64 to no avail. Even i386 is failing. It has something to do with tensor types. I suppose different targets expect/support different tensor types; however, I made sure that my network is using Float32. So this shouldn’t be a problem for i386. The network looks like this:
The traceback that I get when trying to build is:
In case there are any questions about
x-model-runner
: it is a custom runner that I built based on theimage-classifier.
The runner itself is working as expected forx86_64
-- it emits bundles that I'm able to link to and run inference.