Open ratnayak opened 6 years ago
If there's any way we can help in this effort, please let us know, as well. We'd also be happy to take and upstream patches. (Note that there is a very large changeset coming for nv_small; unfortunately, changes in this branch are going to be somewhat difficult to port to nv_small, but we anticipate having a dramatically more open development process in nv_small once we reach our internal quality goals for a first release.)
Hi @ratnayak, I'm trying to put nv_full on FPGA (Xilinx VU13P) recently. I do it according to the following steps:
Now, VIVADO Synthesis has passed, but I don't know whether it's right and what else define should be declared.
Hi @ratnayak,
I would be grateful if you could share approximate LUT/DFF usage for specific NVDLA modules, especially CACC and CMAC. As for CMAC, did you use DW/DW02 modules and run everything in PL or have you already done some optimization work to support DSP48 hardware macros? I am curious how much the design could be optimized in terms of LUT usage.
1) we have some preliminary utilization data for individual partitions for nv_full:(from synthesis)
partition a c m o p slice lut 149452 288511 403105 362374 249373 slice register 83028 144410 55945 373342 166188 F7/F8 mux 320 11807 1644 19257 9265 Block RAM 136 305 0 60 32 DSP slices 0 79 0 144 236
2) we also have done Vivado PnR and the results for the top level are (these would be more accurate results): Device : xcvu440flga2892-1 Tool: Vivado PnR total CLB LUTs taken is 2080842 outof available 2532960 giving utilization of 82.15% total CLB Registers taken is 826797 out of avaialable 5065920 giving utilization of 16.32%
given VU440--which is the largest Xilinx FPGA avaiable--takes 82% utilization no other FPGA will fit nv_full.
3) Regarding CMACs: we did NOT optimize for DSPs slices. convolution MACs are fully implemented in PL. if you go through the RTL code you'd find that NVDLA implements Wallece tree multipliers. These do no map to DSP slices but simply implements in PL. This can be seen on DSPs utilization for partion 'a', which is the convolution acc in above (1). partition 'a' has 1024 16b MACs in nv_full but takes no DSPs.
4) We resently have programmed our VU440 FPGA with the nv_full bit file.
currently we are carrying out preliminary testings. so far we have verified that the bit file is correctly programmed into
VU440 and clks are functing as expected. we'll keep updating on our progress.
Pls share details of your own implemnetations, issued you have faced and solutions you may have proposed so we all can benefit from our shared work.
We just released nv_small, which might help you fit on a Xilinx part. We'd be interested in any feedback that you have.
Where can I find nv_small?Thanks.
On Fri, 20 Apr 2018 at 2:38, Joshua Wisenotifications@github.com wrote:
We just released nv_small, which might help you fit on a Xilinx part. We'd be interested in any feedback that you have.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Sorry please ignore; it's under Master.
Sent from Yahoo Mail on Android
On Fri, 20 Apr 2018 at 7:05, phong kah ho (jiahao)teabun@yahoo.com wrote: Where can I find nv_small?Thanks.
On Fri, 20 Apr 2018 at 2:38, Joshua Wisenotifications@github.com wrote:
We just released nv_small, which might help you fit on a Xilinx part. We'd be interested in any feedback that you have.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi,
At first I would like to thank @ratnayak for VU440 statistics.
I started to work with nv_small. I am after successful synthesis and now attempting to implement design for XCZU9EG (Zynq UltraScale+).
Statistics from Vivado project summary (synthesis only):
Resource | Estimation | Available | Utilization % |
---|---|---|---|
LUT | 76169 | 274080 | 27.79 |
LUTRAM | 204 | 144000 | 0.14 |
FF | 80752 | 548160 | 14.73 |
BRAM | 100.50 | 912 | 11.02 |
DSP | 32 | 2520 | 1.27 |
BUFG | 2 | 404 | 0.50 |
Some observations:
reg
declaration - they are used with assign
statement. I needed to change them to wire
so the design elaboration could pass (patch for review)
UPDATE (2018-06-28): merged #151Hello mmaciag
nv_small can be configured to various sizes in order to fit a given FPGA. did you use default specs given in nv_small.spec or did you change any settings? the default num of MACs for nv_small is 64(8x8) but this can be changed to another configuration. Thanks
I used default specs. Certainly I will test other configurations.
Hi mmaciag did you try to create the IP for the nv_small with Vivado? If so, how did you manage the connections with the Zynq UltraScale+? Thanks
(I wrote this message from wrong account, I am sorry if someone got confused)
Yes, I tried to create IP. At least this is a place where I started and I do not have it working with real hardware yet.
With IP Packager it is relatively straightforward. You need a wrapper module for NV_nvdla
with DBB exposed as AXI4 master, and CSB exposed as APB slave. For the latter you could instantiate apb2csb
which is available in the NVDLA master branch. Obviously you need to expose clocks, reset and interrupt as well. Other signals could be hardwired inside the wrapper.
Later in the block design such wrapped DBB connects to any AXI4 slave, i.e. memory controller. Zynq US+ can expose slave to it's own PS memory controller - it is all configurable in the block design. APB requires AXI to APB bridge which can be selected from IP catalog.
For AXI4 I needed to declare all signals, otherwise I couldn't make connections in the block design. AMBA specification will tell you what to do with unused signals.
Thank you very much, connections explanation was very helpful. Unfortunately, now I have another problem. From the report utilization, the percentage of LUTs that are used is about 160%. This happens because the convolutional buffer is mapped with a lot of LUTs, and from the utilization report it seems that the BRAMs are not used. In the table you showed BRAMs were used, and I think this is the reason why in my design I have a so high percentage for the LUTs usage. Is there a way to map the logic onto the BRAMs?
You probably did not use RAMs from ram/fpga/small_rams, did you? These rams are not parsed by perl scripts and you won't find them in outdir.
With properly inferred BRAMs nv_small is very close to fit UltraZed SoM (ZCU3EG) !! Convolutional buffer operates like quad-or-even-more-port RAM, so it consumes a lot of logic anyway :(
Thanks, it helped a lot
Did you try to create a linux application with Xilinx SDK? How to deal with linux drivers? The problem is that, if I well understood, NVDLA can be run only on a Linux environment, so I cannot create a bare metal application
No, not this far yet to think about Linux drivers. Did you try to program anything on the lowest level? So far I found there is problem with accessing CFGROM.
Well, as a matter of fact maintainers mentioned FreeRTOS somewhere in the documentation. Perhaps DLA is supposed to be portable for various OS, including bare metal in the future.
I was thinking of porting some trace_tests to bare C as a starting point. These work with Verilator (well... mostly), so they should pass in real HW as well.
I think we have greater problem here: there is some sort of high level CNN compiler and runtime library, which presumably is going to work only (?) with NVDLAV1 branch - the source code is not published yet, only precompiled files. This raises the question when we can expect the compiler suitable for nv_small.
I did not consider that point about the compiler, I was just trying for the moment to communicate with the board but I thought there were some sort of linux drivers that could be used. I did not try to program at lowest level, how did you access the registers?
For now simple pointer cast and dereference, all register defs are generated in outdir/nv_small/spec/manual/opendla.h
Actually I wasn't aware that something is happening in nvdla/sw - last commit (4 days ago) seems to provide kernel driver for nv_small.
an update: as i mentioned before we have implemented nv_full on Xilinx VU440 (largest FPGA by Xilinx). we have the host running on an ARM A53 processor on a Xilinx Zynq (ASIC ARM). two FPGAs are connected via single ended signalling.
we have brought up the FPGA system and now in the process of verification. currently we are running the given sanity checks for the system in HW: sw-master\regression\flatbufs\kmd
just want to check whether anyone else has completed the full sanity check list w their HW system.
I did not try yet to perform the verification, because I think I have problems with the hardware implementation. I used the AXI_INTERCONNECT block given by Vivado to connect the DBB interface of the NVDLA to the Zynq, but I am not sure this is correct. Did you use another IP, as for example the DMA, or did you perform another type of connection?
Hi,
My status is that so far I managed to run few verif/tests/trace_tests as bare metal app (with some Regex work I translated the configuration files into C code). These tests pass with correct output CRC32:
I needed to disable CDP and PDP engines, since I am working on relatively small ZCU3EG and I needed extra space for diagnostics IP cores (System ILA, AXI Performance Monitor, etc.).
Most of time I spent on caching issues in Cortex-R5 - finally the following functions from Xilinx SDK helped me a lot to synchronize CPU with updated DRAM content.
For performance reasons probably DBB should be somehow interfaced with ACE and cache snooping, but I am not experienced in this area.
@giusecesa4, please see UG1085 (v1.7) December 22, 2017 Figure 1-1.
Zynq US+ can expose several AXI slave slots, some of them with fast path to DRAM controller. You don't need to have AXI interconnect in this direction (you will need it for CSB interfacing).
update: we have successfully run six out of eight sanity checks in our FPGA HW system ( Host+nvdla accelerator).
sanity checks are in: sw-master\regression\flatbufs\kmd these are virtual platform (sw) sanity checks. but once the complete FPGA HW system (Host+nvdla accelerator) is up and running these sanity checks are useful to verify the basic functionality of HW.
failed sanity checks: CDP & NN_L0_1_fbuf we are in the process of debugging these two checks.
our system: nv_full running on a Xinlinx VU440. host is a ASIC arm in Xilinx Zynq. two FPGAs are connected via direct links (via some AXI modules).
I was trying to run tests by using the UMD and by inserting the loadable files available in the regression folder, waiting for the nvdla compiler for the small configuration. I get this error
(DLA_TEST) Error 0x00000004: runtime->load failed (in RuntimeTest.cpp, function loadLoadable(), line 265)
(DLA_TEST) Error 0x00000004: (propagating from RuntimeTest.cpp, function run(), line 318)
(DLA_TEST) Error 0x00000004: (propagating from main.cpp, function launchTest(), line 87)
Does anyone have some idea? Could it be related to the hardware connections?
Hi @ratnayak , we have a problem to load the opendla.ko module because in my device tree no device called "nvidia,nvdla_os_initial" (as expected in the module source files, file nvdla_core_callbacks.c, line 341), is present. I think we missed something in the hw design. Could you share please your device tree or at least a screen of the HW design? Why the opendla module is expecting to find a device with this name? I would appreciate very much your help
@giusecesa4 are you using prebuilt opendla.ko or did you build it yourself? Do you see "nvidia,nvdla_2" device in your device tree?
No, I don't see it, that is the problem. I think I did something wrong in the hw project with Vivado, or maybe I should add manually the structure in the device tree. I built the opendla.ko by myself because the version of the kernel I used is different with respect to the one with which the prebuilt file was built.
Hi @giusecesa4. I work with ratnayak on the same project. I added a device node for NVDLA KMD in the DTS. For example:
nvdla@a0000000 {
compatible = "nvidia,nvdla_os_initial";
interrupt-parent = <0x4>;
interrupts = <0x0 0x59 0x4 0x0 0x59 0x4>;
reg = <0x0 0xa0000000 0x0 0x40000>;
};
Hope this helps.
@qdchau thanks for your answer. I am generating the device tree (together with the image files) with Petalinux, that I obtain by creating a project with corresponding .hdf file generated by Vivado. In the device tree, the only node I see (that is generated automatically) about the NVDLA is the following:
NV_NVDLA_apb2csb_0: NV_NVDLA_apb2csb@a0000000 {
compatible = "xilinx, NV-NVDLA-apb2csb-1.0";
reg = <0x0 0xa0000000 0x0 0x10000>;
};
I am working with the NVDLA small. Can this node be the same of the one you added, but just with a different name? I also have another problem: I do not have any reference to the interrupts, and obviousely this gives me some problems when inserting the module opendla.ko. Are there some configurations in Vivado to be changed in order to have these properties about the interrupt in the device node? I would appreciate very much your help
@giusecesa4 We are stucked with DRM driver, so I don't know if my device tree is correct or not so take the following with a grain of salt.
If interrupt pin is correctly connected to Zynq IP core in the block design than that information should have been extracted from HDF file by Petalinux build system.
This is what Petalinux generated for me in pl.dtsi
file:
u_nvdla: FPI_NVDLA_wrapper@80000000 {
compatible = "xlnx,FPI-NVDLA-wrapper-1.1";
interrupt-parent = <&gic>;
interrupts = <0 89 4>;
reg = <0x0 0x80000000 0x0 0x10000>;
};
u_nvdla
is my arbitrary name of IP core instance, FPI_NVDLA_wrapper
is my arbitrary name of Top verilog module of IP core. Compatible field name is also somehow related with name of my IP core. As you can see I also mapped 0x80000000 address to CSB which is also arbitrary.
This file is not supposed to be modified by user but you can add your modification in project-spec/meta-user/recipes-bsp/device-tree/files/system-user.dtsi
. My intuition is telling me that I need to change only the compatible field.
&u_nvdla {
compatible = "nvidia,nvdla_os_initial";
};
Hi @giusecesa4. Since our project uses nv_full with Xilinx VU440 (separate FPGA from ZynqMP), the setup is different than yours in that NVDLA will not be imported from the HDF file to the device tree. I think the node you shared could be what you are looking for if your design mapped CSB to 0xa0000000. You probably need to double-check your interrupt connection to Zynq like mmaciag suggested if you don't see it in the device node. I find PetaLinux difficult to work with so I typically modify the DTS directly and generate new DTBs to test via dtc.
Thank you for your answers, the problem that I have is that the device tree is not able to represent the NVDLA block, but just the APB2CSB block that is directly connected to the Zynq. As the intr port is in the NVDLA IP, it is not present in the device tree. Did you create a unique IP with APB2CSB and NVDLA and then connect it to the Zynq? Because I worked with them separately (putting both of them in the final block design) and maybe this is my problem.
My module wraps both APB2CSB and NVDLA, so yes, maybe here is the problem.
Thank you @mmaciag , with the wrapper I don't have the interrupt problem anymore! I saw from your previous messages that you had some problem with the drm. I also have an error when I try to insert the module opendla.ko: is says somenthing like "failed to register drm device". I think this problem is due to this function (in kmd, file nvdla_gem.c):
/**
* TODO Register separate driver for memory and use DT node to
* read memory range
*/
dma = dma_declare_coherent_memory(drm->dev, 0xC0000000, 0xC0000000,
0x40000000, DMA_MEMORY_MAP | DMA_MEMORY_EXCLUSIVE);
if (!(dma & DMA_MEMORY_MAP)) {
err = -ENOMEM;
goto unref;
}
Did you find the same problem?
Yes! but changing address to range physically mapped to PS DDR controller did not fix the problem easily. For now I have a quick workaround: I have just forbidden kernel from allocating physical range 0x70000000-0x7FFFFFFF (or highest 256 MB of 2 GB available space). Otherwise dma_declare_coherent_memory
would fail anyway. I guess there is a smarter way, e.g. following the comment about DT node, but for now I can live with that :)
For address range exclusion I followed this page: http://www.wiki.xilinx.com/Linux+Reserved+Memory
By the way I don't recall "failed to register drm device" message. I was rather experiencing utter crash of entire kernel. If I am correct 0xC0000000 is mapped to one of PL AXI4 slaves - if you don't have anything there, bad things are going to happen.
@mmaciag Can you submit a pull request for the reg patch?
Thanks, joshua
@mmaciag , I think I did not solve the problem of the reserved memory. I tried to follow both the three examples explained in the tutorial you provided me, but without finding the correct solution.
I don't understand what actually to do with a reserved memory. Should I reserve the memory for my specific driver, that is the NVDLA wrapper or should I create a dma shared memory pool?
Then, at this point the function dma_declare_coherent_memory
should be used or not? If I try to run the code in the tutorial I have a segmentation fault.
Hi, I intend to implement NVDLA system using a KCU1500 board (https://www.xilinx.com/products/boards-and-kits/dk-u1-kcu1500-g.html#hardware) with a XCKU115-2FLVB2104E FPGA, the host processor would be a Xeon and I am still not sure about the DDR4, I probably would go with the one on the board. I have been following this thread discussion for quite a while now, and I noticed that quite many of you have been implementing the system in a Zynq board. I understand the appeal on that, but I was thinking in using PCIe to communicate with the host processor. So, since you have been implementing this system (and encountering some of the issues on the deal) I would like to know from you how difficult would it be to use PCIe for this purpose? Especially regarding the modifications to the Linux drivers. I am not very experienced with Linux device drivers, hence the question. Thanks in advance for any reply I might have.
Hi, NVDLA uses AMBA compliant protocols: AXI4 and with little effort APB. I am not a PCIe expert, but i think it reduces to implementing some sort of PCIe-to-APB bridge with interrupt support - on the other side it would be visible just as a memory mapped device. I think @ratnayak has got use case more in common with yours, because he ran Zynq, Virtex and some protocol in between.
Hi @mmaciag , I still have some problems with the allocation of the reserved memory. I have a warning in when calling the function dma_declare_coherent_memory
:
memremap attempted on ram 0x70000000 size: 0x10000000
and the function returns 22, that is an error value. I have a lot of doubts about which kind of reserved memory I have to use, could you please give me some hints?
Can you share your device tree modifications? I cannot think of more changes I did to get something working other than reserved-memory
node in device tree, NVDLA node compatible
field and dma_declare_coherent_memory
address change from 0xC0000000 to 0x70000000.
Did you compiled the module and the image for the NVDLA using Petalinux? Because the precompiled opendla.ko has been compiled with a kernel version of linux 4.13.3, where the function dma_declare_coherent_memory
returned a different value with respect to the version 4.14, that I am using. I am afraid this could be the problem.
How did you compile the opendla.ko module?
@giusecesa4 Ouch!... I am using Petalinux 2017.4 based on kernel 4.9 - It was quite basic information worth mentioning :). You are presumably using 2018.1 or 2018.2 with kernel updated to 4.14. There could be differences of course. So happened to DRM API: drm_gem_object_put_unlocked
was introduced barely in 4.12 so I was forced to change it to drm_gem_object_unreference_unlocked
. I am using 2017.4 because I have stable BSP for that version.
Status update:
Recently I have successfully run all (except for NN) available flatbuf tests for nv_small (available in regression\flatbufs\kmd) - CONV, SDP, PDP and CDP. To go further I would need to have caffe compiler or at least another set of precompiled loadables - preferably entire deep network, most preferably AlexNet. (there is NN actually)
I am also getting consistent results from running verif/trace_tests/nv_small both in Verilator and on Zynq RPU (Cortex-R5) - this is my 'sandbox' processor where I am testing low level access to NVDLA. 4 tests fail CRC and 2 tests run into dead lock - this is also consistent between Verilator and Zynq! I am going to open new issue on that soon, because it fails even on vanilla master branch.
EDIT: note about NN flatbuf.
Thank you very much! Did you modified also something in the code of the runtime test? Because I see also there some file taken from the linux kernel (like drm.h and drm_mode.h) and of course there are a lot of calls to some functions of the kernel.
Not really. In UMD I have only changed one fixed path in ErrorMacros.h
just to make sure it won't fail on non-existent location. Do not forget about compiling KMD with DLA_SMALL_CONFIG
flag! I have lost few hours on debugging KMD trying to access non-existent nv_full hardware.
-#define NVDLA_UTILS_ERROR_PATH "nvidia/tegra/cv/dla/"
+#define NVDLA_UTILS_ERROR_PATH "/var/log/dla/"
Ok, because I think I finally succeded to insert the module, but now the UMD stuck at this point
Launching test creating new runtime context... Emulator starting submitting tasks... [ 60.560536] Enter:dla_read_network_config [ 60.570092] Exit:dla_read_network_config status=0 [ 60.580323] Enter: dla_initiate_processors [ 60.589960] Enter: dla_submit_operation [ 60.596205] Prepare Convolution operation index 0 ROI 0 dep_count 1 [ 60.602453] Enter: dla_prepare_operation
Then, after that, I have this line that is appearing from time to time, I think the process stalls at some point
INFO: rcu_sched detected stalls on CPUs/tasks: [ 81.592957] 1-...: (6 ticks this GP) idle=74a/140000000000000/0 softirq=1169/1169 fqs=208 [ 81.601287] (detected by 2, t=5252 jiffies, g=15, c=14, q=2) [ 81.607015] Task dump for CPU 1: [ 81.610225] nvdla_runtime R running task 0 2183 2171 0x00000002 [ 81.617258] Call trace: [ 81.619695] [
] __switch_to+0x98/0xb0 [ 81.624811] [<000000000000001d>] 0x1d
Did you find a similar problem?
Not in this stage. User mode driver communicates over /dev/dri/renderD128
device file. Maybe you should check if DRM device is correctly registered - this is not reported in dmesg... at least not in PetaLinux 2017.4.
Before inserting opendla.ko:
root@zynq:~# ls /sys/kernel/debug/dri/
0 64
root@zynq:~# ls /dev/dri/*
/dev/dri/card0 /dev/dri/controlD64
root@zynq:~#
After inserting opendla.ko:
root@zynq:~# insmod /opt/lib/modules/opendla.ko
[ 206.008280] opendla: loading out-of-tree module taints kernel.
root@zynq:~# ls /sys/kernel/debug/dri/
0 1 128 64
root@zynq:~# ls /dev/dri/*
/dev/dri/card0 /dev/dri/card1 /dev/dri/controlD64 /dev/dri/renderD128
root@zynq:~# ls /sys/kernel/debug/dri/
0/ 1/ 128/ 64/
root@zynq:~# cat /sys/kernel/debug/dri/128/name
nvdla dev=80000000.FPI_NVDLA_wrapper unique=80000000.FPI_NVDLA_wrapper
root@zynq:~#
Here is my little script to automate flatbufs test a bit:
#!/bin/sh
export LD_LIBRARY_PATH=/opt/lib/
RUNTIME=/opt/bin/nvdla_runtime
FLATBUFS=/opt/regression/flatbufs/kmd
TESTS="PDP/PDP_L0_0_small_fbuf CONV/CONV_D_L0_0_small_fbuf SDP/SDP_X1_L0_0_small_fbuf CDP/CDP_L0_0_small_fbuf"
echo "== FLATBUF tests for nv_small =="
echo "================================"
for test in ${TESTS} ; do
echo "= Run ${test}"
${RUNTIME} --loadable ${FLATBUFS}/${test}
done
And results:
root@zynq:~# ./run.sh
== FLATBUF tests for nv_small ==
================================
= Run PDP/PDP_L0_0_small_fbuf
creating new runtime context...
Emulator starting
submitting tasks...
Test pass
= Run CONV/CONV_D_L0_0_small_fbuf
creating new runtime context...
Emulator starting
submitting tasks...
Test pass
= Run SDP/SDP_X1_L0_0_small_fbuf
creating new runtime context...
Emulator starting
submitting tasks...
Test pass
= Run CDP/CDP_L0_0_small_fbuf
creating new runtime context...
Emulator starting
submitting tasks...
Test pass
root@zynq:~#
Unfortunately at certain moment I disabled all debug prints in kernel module, so I don't have
https://gist.github.com/mmaciag/976cb7121acf066d0a3a84d210474a6fdmesg
from it.
Yes, I already considered this point and I verified that the renderD128 is correctly present in the dev/dri folder. I think my problem is related to another issue. The runtime stalls when it tries to start the execution of the operation (in correspondence of the function submit
.
I considered that the problem could be related to something wrong in the hardware, since submit
is a blocking function. I found that the timing in Vivado was not closed (I have problems with negative hold time, but unfortunately Vivado allows to compute the bitstream anyway, so I did not see this problem before) and maybe this could be the point due to which my application stalls. Did you find some issues in timing in Vivado? At which frequency are you working? How did you manage all the clocks which have to be connected?
What are your constraints for clock? Do you use VLIB_BYPASS_POWER_CG
macro?
I started from very permissive constraint 20 MHz clock (later 50 MHz) although it should run at 90 MHz according to timing report. This is for slow -1 grade device. For larger devices (I am testing XCZU3EG now) and speed grade -2, Vivado with little effort achieves 150 MHz. And this is without any attempt to optimize design internally. With a bit more effort I think it is possible to push above 200 MHz, but this is not my goal in this moment.
Thanks for your support!! I succeded to run the same flatbufs test you run! Let us hope that the compiler is coming soon. In the folder regression/flatbufs/kmd/NN there is one test that is supposed to be the complete AlexNet for the small architecture. I tried to run that test but it stalls. Did you try to do it?
This is an inquiry to the wider audience who are working on getting NVDLA running on a FPGA platform--we'd like to share what we are doing and check progress on other groups out there .
we are putting nv_full (as nv_small is not available as yet) on Xilinx VU440 (largest FPGA by Xilinx). nv_full fits comfortably on VU440 utilizing about 82% LUTs as per Vivado implementation. we'd have the host running on an ARM A53 processor on a Xilinx Zynq (ASIC ARM). the signalling between two FPGAs--Zynq(host) and VU440(nv_full) will be via direct parallel connections (simple & straightforward) and we have enough IOs for this.
a) we'd like to know if anyone out there has already implemented nv_full on a FPGA platform and successful in getting it running as expected. b) we would very much like to sync w you--either you got it running or still some distance to go-- perhaps we can discuss details, inputs on issues encountered, how resolved them etc.
look forward to hearing from you.