Closed JunningWu closed 6 years ago
Here is the Error.
/# ./nvdla_runtime --loadable output.protobuf creating new runtime context... [ 1181.040928] nvdla_runtime[1287]: unhandled level 1 translation fault (11) at 0x41ffb65a, esr 0x92000005, in libnvdla_runtime.so[ffffb54a2000+21000] [ 1181.047469] CPU: 0 PID: 1287 Comm: nvdla_runtime Not tainted 4.13.3 #1 [ 1181.054489] Hardware name: linux,dummy-virt (DT) [ 1181.055224] task: ffff80003db40e00 task.stack: ffff80003d16c000 [ 1181.055792] PC is at 0xffffb54b7d98 [ 1181.056133] LR is at 0xffffb54b3984 [ 1181.056400] pc : [<0000ffffb54b7d98>] lr : [<0000ffffb54b3984>] pstate: 80000000 [ 1181.056903] sp : 0000ffffd6afd030 [ 1181.057213] x29: 0000ffffd6afd030 x28: 0000000000000000 [ 1181.067488] x27: 0000000000000000 x26: 0000000000000000 [ 1181.067886] x25: 0000000000000000 x24: 0000000000000000 [ 1181.068077] x23: 0000000000000000 x22: 0000000000000000 [ 1181.071035] x21: 0000000039db2190 x20: 0000000000000000 [ 1181.071292] x19: 0000000041ffb65a x18: 0000000000000000 [ 1181.071481] x17: 0000ffffb54d3fb0 x16: 0000ffffb54b7d98 [ 1181.071732] x15: 0000000000000111 x14: 00000000000003f3 [ 1181.084578] x13: 0000000000000000 x12: 0000ffffb549f968 [ 1181.088542] x11: 0000000000000022 x10: 0000000000000007 [ 1181.093569] x9 : 0000000000001500 x8 : 0000000000000003 [ 1181.097488] x7 : 0000000000000001 x6 : 0000000039db2330 [ 1181.100436] x5 : 0000000000000041 x4 : 0000000000000001 [ 1181.111960] x3 : 0000ffffb54d4928 x2 : 0000ffffb54b3950 [ 1181.112223] x1 : 0000000000000006 x0 : 0000000041ffb65a Segmentation fault
Anyone may help???
/# gdb -sh: gdb: not found
@wujunning2011 I think your input loadable file is wrong. You can try any file, and will get the same error.
I use the loadable file in the docker, but get out of bounds error. Any suggestions?
./nvdla_runtime --loadable BDMA_L0_0_fbuf creating new runtime context... libnvdla<1> failed to open dla device libnvdla<1> Out of bounds DLA instance 0 requested. (DLA_TEST) Error 0x00000004: runtime->load failed (in RuntimeTest.cpp, function loadLoadable(), line 253) (DLA_TEST) Error 0x00000004: (propagating from RuntimeTest.cpp, function run(), line 377) (DLA_TEST) Error 0x00000004: (propagating from main.cpp, function launchTest(), line 92)
@xmchen1987 Have u installed the driver (drm.ko, opendla.ko) first?
@xmchen1987 can you share your loadable file?
@jarodw0723 when I install the drm.ko/opendla.ko, I got level 2 translation fault. nvdla_runtime[1306]: unhandled level 2 translation fault (11) at 0x2bf2565a, esr 0x92000006
@wujunning2011 You can find the loadable file in https://github.com/nvdla/sw/tree/master/regression/flatbufs/kmd
@jarodw0723 After installing the driver, it sucesses. Thanks a lot. @wujunning2011 As @jarodw0723 mentioned, I use the prebuild file in https://github.com/nvdla/sw/tree/master/regression/flatbufs/kmd
@jarodw0723 Is the VP able to dump performance data currently? Or just for software development?
@xmchen1987 It is just for software development.
@jarodw0723 @xmchen1987 When I run in Docker mod, It is OK. I wonder how to get my own loadable file using nvdla_compiler, such as AlexNet.caffemodel
@jarodw0723 I see currently cmod has interface like: NV_NVDLA_cmac::NV_NVDLA_cmac( sc_module_name module_name ): NV_NVDLA_cmac_base(module_name), // Delay setup dmadelay(SC_ZERO_TIME), csbdelay(SC_ZERO_TIME), b_transportdelay(SC_ZERO_TIME)
Do you have plan to develop cmod as performance model?
I have the same problem as wujunning2011. If the loadable file is wrong,how can i get loadable file using nvdla_compiler, such as AlexNet.caffemodel. I have read UMD code ,and known the nvdla_runtime need "flatbuf" for loadable file .But the nvdla_compiler create out file format is "protobuf". Should i make the format convert,or is there any option for nvdla_compiler to make it create "flatbuf" file ??
+1 same request to compile and run a custom model.
Besides, @jarodw0723 Is there any schedule when the performance profiling function be ready?
@wujunning2011 @geyijun @blueardour output.protobuf is no the target loadable file. I use the default.nvdla, and succeed to run the test.
@xmchen1987 Thank You very much. Maybe default.nvdla is the Loadable file.
@xmchen1987 May I know more about your "default.nvdla" test? What network and what command do you use to test "default.nvdla"?
@xmchen1987 Thanks.
Hi,
It stuck when run nvdla_runtime --loadable default.nvdla
. Any clue?
................................
Welcome to Buildroot
nvdla login: root
Password:
# mount -t 9p -o trans=virtio r /mnt
# cd /mnt/
# ls
CMakeCache.txt README.md install_manifest.txt
CMakeFiles aarch64_toplevel libs
CMakeLists.txt cmake models
CPackConfig.cmake cmake_install.cmake scripts
CPackSourceConfig.cmake conf src
LICENSE docker tests
Makefile images
# cd images/
# cd linux-4.13.3/
# ls
??%@@???@8 drm.ko nvdla_runtime
CONV_D_L0_0_fbuf efi-virtio.rom opendla.ko
Image libnvdla_compiler.so rootfs.ext4
aarch64_nvdla.lua libnvdla_runtime.so
alexnet nvdla_compiler
# insmod drm.ko
# insmod opendla.ko
[ 35.852221] opendla: loading out-of-tree module taints kernel.
[ 35.863261] reset engine done
[ 35.872695] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
# export LD_LIBRARY_PATH=$PWD
# ./nvdla_runtime --loadable alexnet/default.nvdla
creating new runtime context...
[ 55.045834] random: crng init done
^C^C^X^C^C^C # stuck here
................................
also tried agian:
................................
Welcome to Buildroot
nvdla login: root
Password:
# mount -t 9p -o trans=virtio r /mnt
# cd /mnt/images/linux-4.13.3/
# export LD_LIBRARY_PATH=$PWD
# cd alexnet/
# ./../nvdla_runtime --loadable default.nvdla
creating new runtime context...
[ 48.162132] random: crng init done
^C
# cd ..
# insmod drm.ko
# insmod opendla.ko
[ 70.167764] opendla: loading out-of-tree module taints kernel.
[ 70.179404] reset engine done
[ 70.188440] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
# dmesg| tail
[ 1.893330] VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
[ 1.912059] devtmpfs: mounted
[ 2.054710] Freeing unused kernel memory: 1088K
[ 2.232956] EXT4-fs (vda): re-mounted. Opts: data=ordered
[ 3.830183] NET: Registered protocol family 10
[ 3.848857] Segment Routing with IPv6
[ 48.162132] random: crng init done
[ 70.167764] opendla: loading out-of-tree module taints kernel.
[ 70.179404] reset engine done
[ 70.188440] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
# cd alexnet/
# ./../nvdla_runtime --loadable default.nvdla
creating new runtime context...
^C # stuck here
# dmesg| tail
[ 1.893330] VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
[ 1.912059] devtmpfs: mounted
[ 2.054710] Freeing unused kernel memory: 1088K
[ 2.232956] EXT4-fs (vda): re-mounted. Opts: data=ordered
[ 3.830183] NET: Registered protocol family 10
[ 3.848857] Segment Routing with IPv6
[ 48.162132] random: crng init done
[ 70.167764] opendla: loading out-of-tree module taints kernel.
[ 70.179404] reset engine done
[ 70.188440] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
................................
@blueardour I suppose it's because the AlexNet is too HUGE, which may take about 20mins to create the context, you can try LeNet first.
@wujunning2011 Hi, thanks for your tips.
May I ask whether you ever successfully run the alexnet?
Based on your comment, I left the program running. After 14 hours, it seems the simulator still not finished the execution.
..................................
# insmod opendla.ko
[ 43.227254] opendla: loading out-of-tree module taints kernel.
[ 43.239206] reset engine done
[ 43.248391] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
# export LD_LIBRARY_PATH=$PWD
# ./nvdla_runtime --loadable alexnet/default.nvdla
creating new runtime context...
[ 72.144474] random: crng init done
Unknown image type: submitting tasks...
[ 7082.524154] Enter:dla_read_network_config
[ 7082.528186] Exit:dla_read_network_config status=0
[ 7082.528669] Enter: dla_initiate_processors
[ 7082.531573] Enter: dla_submit_operation
[ 7082.532029] Prepare Convolution operation index 0 ROI 0 dep_count 1
[ 7082.532483] Enter: dla_prepare_operation
[ 7082.535457] processor:Convolution group:0, rdma_group:0 available
[ 7082.536056] Enter: dla_read_config
[ 7082.543696] Exit: dla_read_config
[ 7082.544123] Exit: dla_prepare_operation status=0
[ 7082.544593] Enter: dla_program_operation
[ 7082.546769] Program Convolution operation index 0 ROI 0 Group[0]
[ 7082.555487] no desc get due to index==-1
[ 7082.556460] no desc get due to index==-1
[ 7082.558436] no desc get due to index==-1
[ 7082.558787] no desc get due to index==-1
........................
7083.737498] Exit: dla_op_programmed
[ 7083.737643] Exit: dla_program_operation status=0
[ 7083.737814] Exit: dla_submit_operation
[ 7083.737961] Enter: dla_dequeue_operation
[ 7083.738115] Dequeue op from CDP processor, index=18 ROI=0
[ 7083.738301] Enter: dla_submit_operation
[ 7083.738456] Prepare CDP operation index 18 ROI 0 dep_count 1
[ 7083.738651] Enter: dla_prepare_operation
[ 7083.738871] processor:CDP group:1, rdma_group:1 available
[ 7083.739062] Enter: dla_read_config
[ 7083.741600] Exit: dla_read_config
[ 7083.741748] Exit: dla_prepare_operation status=0
[ 7083.741936] Enter: dla_program_operation
[ 7083.742096] Program CDP operation index 18 ROI 0 Group[1]
[ 7083.742494] Enter: dla_cdp_program
[ 7083.742563] Enter: processor_cdp_program
[ 7083.753187] Exit: processor_cdp_program
[ 7083.753201] Exit: dla_cdp_program
[ 7083.753356] no desc get due to index==-1
[ 7083.753615] no desc get due to index==-1
[ 7083.753760] no desc get due to index==-1
[ 7083.753910] no desc get due to index==-1
[ 7083.754058] no desc get due to index==-1
[ 7083.754210] no desc get due to index==-1
[ 7083.754362] Enter: dla_op_programmed
[ 7083.754505] Exit: dla_op_programmed
[ 7083.754649] Exit: dla_program_operation status=0
[ 7083.754817] Exit: dla_submit_operation
[ 7083.754966] Exit: dla_dequeue_operation
[ 7083.755133] Exit: dla_initiate_processors status=0
[ 7083.755376] Enter:dla_handle_events, processor:BDMA
[ 7083.755620] Exit:dla_handle_events, ret:0
[ 7083.755800] Enter:dla_handle_events, processor:Convolution
[ 7083.756012] Handle cdma weight done event, processor Convolution group 0
[ 7083.756260] Exit:dla_handle_events, ret:0
[ 7083.756416] Enter:dla_handle_events, processor:SDP
[ 7083.756592] Exit:dla_handle_events, ret:0
[ 7083.756758] Enter:dla_handle_events, processor:PDP
[ 7083.756937] Exit:dla_handle_events, ret:0
[ 7083.757092] Enter:dla_handle_events, processor:CDP
[ 7083.757269] Exit:dla_handle_events, ret:0
[ 7083.757422] Enter:dla_handle_events, processor:RUBIK
[ 7083.757602] Exit:dla_handle_events, ret:0
As you mentioned a try of the lenet. The computing complexity of the Alexnet is about 1G mac according to the tool of Netscope CNN analyzer. However, most of my focused networks are bigger than Alexnet. Thus, if the simulator is so slow, it might be some kinds of unacceptable for me to run my own networks.
@blueardour my ALexNet's Running is not successful, which is also stucked at somewhere. According to the NVDLA VP configuration file, the system mem is 1MB, so this may have some influence on the AlexNet's Running.
with such huge NN, I suggest that you may USE Candence's Protium or Synopsys's ZEBU.
BTW, when I run the tiny Lenet, there stil has some errors, hope you can GOTO LeNet, and GIVE me some help.
hi, @wujunning2011 sorry for late reply. After having a try of Lenet, I also failed to run it successfully.
Hi, has any one found a solution to this? I am having the same issue and it seems that it is not the system virtual memory issue? Any help is appreciated. Thanks!
@JunningWu NVDLA VP configuration should be using 1GB system mem, which config file are you checking for it?
are you able to run LeNet?
Hi, still having issues with AlexNet -- running it with 1GB system mem. I tried running utilizing the latest NVDLA updates (with this one, we can load a .jpg image format). Please see the attached log file for more info: There are some error messages that are ignored and it hangs at the last point shown in the log file:
20180205_pascalvoc_BoatRes227x227.jpg.log
Regarding LeNet: I was able to run it all the way thru without any issues (Here, the input file format used is .pgm) .
I am able to reproduce it, created #21 for debugging AlexNet failure
@ned-varnica
Thanks, I ran it again (and with 2GB system memory) and that particular line is gone, but the run-time problem remains: It hangs in the same place. Please see the attached log file. 20180207_pascalvoc_BoatRes227x227.jpg.log
@ned-varnica I am suspecting some problem with cmod, debugging it with our HW team
@ned-varnica according to your logfile, L1298, Assertion Failed, maybe some engine error was happened, after have processed 24HWLs.This error is the same one as 1GB RAM case.
So, may I conclude that when we increase the RAM size from 1GB to 2GB, we can resolve the rcu_preempt error?
BTW, what does the first line "random: rcng init done" mean?
Hi @ned-varnica , I am trying to reproduce the issue of running AlexNet. The log shows that CACC is not in idle state as expected. Are you using cmod built from latest version of nvdla1 branch? Could you share your generated flatbuf file? Thanks.
Hi everyone, I'm working with @ned-varnica on the same project.
@prasshantg Thanks for the update.
@JunningWu Not sure if we can draw that conclusion. We may need to repeat the run a few times with the 2GB RAM config to see if the CPU stall occurs again. "random: crng init done" is a message from the kernel random number generator driver.
@fanqifei I built the cmod from a clone of the 'nvdlav1' branch in December. The last commit I see is from 12/12/17 "new lsd design" (d9eefc7). I'm not sure what you mean by flatbuf file. I tried to attach the NVDLA loadable binary we generated from the nvdla_compiler but the file is too big even after compression.
@qdchau , can you send it to efan@nvidia.com? flatbuf file is the loadable file generated by nvdla_compiler.
@qdchau @ned-varnica I can't reproduce the issue of test hang. The test of alexnet can pass with change in cmod/include/log.h (see below. This change seems not related to the hang issue.) The version of hw nvdla1 and vp are latest version. (docker is not used) I will try to use the docker later.
+static char msg_buf[MSG_BUF_SIZE];
+
#define cslDebugInternal(lvl, ...) do {\
- char msg_buf[MSG_BUF_SIZE]; \
int pos = snprintf(msg_buf, MSG_BUF_SIZE, "%d:", __LINE__); \
snprintf(msg_buf + pos, MSG_BUF_SIZE - pos, __VA_ARGS__); \
SC_REPORT_INFO_VERB(__FILENAME__, msg_buf, SC_DEBUG ); \
@@ -34,7 +35,6 @@
#define cslDebug(args) cslDebugInternal args
#define cslInfoInternal(...) do {\
- char msg_buf[MSG_BUF_SIZE]; \
int pos = snprintf(msg_buf, MSG_BUF_SIZE, "%d:", __LINE__); \
snprintf(msg_buf + pos, MSG_BUF_SIZE - pos, __VA_ARGS__); \
SC_REPORT_INFO_VERB(__FILENAME__, msg_buf, SC_FULL ); \
@@ -42,7 +42,6 @@
#define cslInfo(args) cslInfoInternal args
#define FAILInternal(...) do {\
- char msg_buf[MSG_BUF_SIZE]; \
int pos = snprintf(msg_buf, MSG_BUF_SIZE, "%d:", __LINE__); \
snprintf(msg_buf + pos, MSG_BUF_SIZE - pos, __VA_ARGS__); \
SC_REPORT_INFO(__FILENAME__, msg_buf ); \
Hi @fanqifei, for clarification are you working with prasshantg to debug or reproducing the issue independently? Thanks for sharing the source change. Do you recommend we add that code, rebuild the model, and try again? I tried to e-mail you the AlexNet loadable binary but it's too big for an e-mail attachment. We compiled it using the Caffe model and prototxt file from Model Zoo: https://github.com/BVLC/caffe/wiki/Model-Zoo#pascal-voc-2012-multilabel-classification-model
Hi @qdchau , I am working with Prashant. I can reproduce the issue now. Looking into it.
Hi @fanqifei, thank you for looking into this issue. We look forward to your feedback. Please let us know if there is any other information you need from us at this stage.
FYI -- some of the team is out for CNY this week. I'll follow up to see who's around, but expect a little more latency on this one. Thanks!
We have resolved this issue, will fix push to KMD. Waiting for some verification results.
Great, thanks. Much appreciated!
Hi @prasshantg, @jwise. Would it be possible to ask for ballpark estimate of when the AlexNet fix will be available so we can update our team’s schedule?
@qdchau 5th Mar 2018
Awesome. Thank you!
@qdchau @ned-varnica @JunningWu fix for alexnet pushed. please test it.
Hi @prasshantg. The fix works for us. Thanks for your help!
Thanks so much @prasshantg. Should we be expecting the correct output at this point? We tried this AlexNet with some images and got outputs that look like noise (negative values close to 0). On the other hand, when we run the same network on our local CPU we get very good prediction with same input images (1 out of 20 output values is a large positive number, and this matches the correct label). Do you have any recommendation how to proceed with debugging? Thanks!
@ned-varnica I think the rawdump file will contain 1000 predictions, like this http://ddl.escience.cn/f/Qdtr. By the way, I am using the BVLC trained model. and the input image is http://ddl.escience.cn/f/Qdts.
@JunningWu do you get expected results?
@prasshantg I am trying to figure out whether the result is indicating "CAT". The simulation process is ok, no more errors.
@JunningWu In the example we are running, it has 20 outputs. The network was taken from Caffe Model Zoo http://heatmapping.org/files/bvlc_model_zoo/pascal_voc_2012_multilabel/deploy_x30.prototxt
It was trained on the following 20 categories:
In your example, looking at the rawdump file, it seems you are seeing the same issue as we do. All the entries (in your case 1000 of them, in our case 20 of them) show very small values and nothing stands out. At least, this is our experience so far.
@prasshantg Attached is one JPG image we used and the corresponding rawdump file.
This could be due to missing mean subtraction feature in compiler. Let me confirm it.
Thanks @prasshantg. I agree this is the part of it, but there is probably more to it. FYI, I tried removing mean subtraction in our local simulator (just to test this hypothesis) and the result still looks OK: It can still produce outputs showing that 'Boat" is much more likely than the other 19 outputs. The confidence is worse (compared to the confidence when the appropriate means are used), but looks fine. On the other hand, the outputs we get in the file 20180306_pascalvoc_BoatRes227x227.jpg.dimg.txt (please see previous message) are not showing this behavior.
./nvdla_runtime -loadable output.protobuf
Usage: ./nvdla_runtime [-options] --loadable
where options include:
-h print this help message
-s launch test in server mode
--loadable
--image
--imgshift
--imgscale
--imgpower
--softmax