open-power / snap

CAPI SNAP Framework Hardware and Software
Apache License 2.0
109 stars 61 forks source link

CAPI2.0 snap_core sends 'U' to ah_cea #789

Closed luyong6 closed 6 years ago

luyong6 commented 6 years ago

In branch "Add_hdl_helloworld" There may have a bug around DMA logic.

Reproduce the issue

Please select Action Type (HDL Action - manually set ACTION_ROOT in snap_env.sh!) in make snap_config and set export ACTION_ROOT=${SNAP_ROOT}/actions/hdl_helloworld in snap_env.sh, and select a CAPI2 card.

Please go to actions/hdl_helloworld/ip and create a Vivado project, import the xci file under fifo_sync_32_512i512o directory and upgrade ip to your selected FPGA part and generate output files.

Then make model, make sim. In the poped up window, snap_maint -v, hdl_helloworld You will see the simulation was terminated in the middle,

ncsim: *E,TRRANGEC: range constraint violation.
          File: /afs/vlsilab.boeblingen.ibm.com/u/luyong/capi/vol3/snap/hardware/hdl/core/dma.vhd, line = 1615, pos = 28
         Scope: top.a0:snap_core_i:dma:dma_fsm
          Time: 14242 NS + 1

/afs/vlsilab.boeblingen.ibm.com/u/luyong/capi/vol3/snap/hardware/hdl/core/dma.vhd:1615             dma_wreq_cnt_q <= dma_wreq_cnt_q - 1;

And PSLSE also hints

14126000: ah_cea has either X or Z value =0xffffffffffffffc0
14126000: Command Valid: ccom=0x1f01
14130000: ah_cea has either X or Z value =0xffffffffffffffc0
14130000: Command Valid: ccom=0x1f01

The same hdl_helloworld code has been simulated on CAPI1.0 card S121B and runs well on hardware. So I am afraid there may be a bug around the 2.0 dma logic.

Or simply, you can directly go to my simulation output dir:

/afs/vlsilab.boeblingen.ibm.com/u/luyong/capi/vol3/snap/hardware/20180725_095452

The function is simple:

top.a0:action_w.action_hdl_helloworld.msnap_action_shim.maxi_lite_slave.pattern_source_address[63:0]
top.a0:action_w.action_hdl_helloworld.msnap_action_shim.maxi_lite_slave.pattern_target_address[63:0]
top.a0:action_w.action_hdl_helloworld.msnap_action_shim.maxi_lite_slave.pattern_total_number[63:0] 

tells the source/target address and how many bytes to copy.

I checked the behavior on AXI bus and it seems OK. I don't know why ah_cea has two cycles of yellow line. msnap_action_shim sends Write operation of 64 burst length, each 64B. snap_core will convert it to 512B write. ah_com = 1F01. default

ThomasFuchs commented 6 years ago

The AXI write protocol allows two different ways. The first one is: sending a reuest that includes the address and the amount of data and than the data. The second is, first sending the data and than the request witch address and data length. The current DMA implementation can only handle the first protocol.

ThomasFuchs commented 6 years ago

I have implement the support for the second protocol. The new logic is available in the branch Issue_789. Please test

luyong6 commented 6 years ago

/afs/vlsilab.boeblingen.ibm.com/u/luyong/capi/vol3/snap/hardware/20180729_052030

Hi Thomas, I copied the two files capi_dma10.vhd_source and capi_dma20.vhd_source into branch hdl_helloworld, and the simulation on CAPI2.0 shows a hang in the waveform. Can you have a look? Also copy @xpower2d

xpower2d commented 6 years ago

@luyong6 @ThomasFuchs I think there's a problem in the AXI4 read channel, where RLAST asserts every 8 transfers instead of only once at the end of current burst. The AXI4 protocol stipulates that RLAST must assert when it's "driving the final read transfer in the burst."

Actually the DMA module depends on the RLAST to tell when burst is done. In the given case the read statemachine lost track and went into a deadlock due to the RLAST.

image

ThomasFuchs commented 6 years ago

@xpower2d , thanks for the good hint. I was searching for the failure on the write channel. You are right, the root cause is on the read channel. A small cut and paste error in the RLAST logic cause the trouble! Please rerun with my latest fix.

luyong6 commented 6 years ago

@ThomasFuchs I copied the dma_buff file and run again. The deadlock is still there. /afs/vlsilab.boeblingen.ibm.com/u/luyong/capi/vol3/snap/hardware/20180731_053707

default

ThomasFuchs commented 6 years ago

@luyong6 next try ;-)

ThomasFuchs commented 6 years ago

@luyong6, do you have tested the latest version of the fix? It is available in the branch Issue_789_rebaseFixbsp.

luyong6 commented 6 years ago

Hi @ThomasFuchs because this branch didn't have the hdl_helloworld which can verify your fixes. How about making a merge? Or would you like to check other commits here? Issue_789_rebaseFixbsp ==> Add_again_hdl_helloworld

You can see them in gitk tool.

ThomasFuchs commented 6 years ago

Hi @luyong6, Sven rebased the branch Add_again_hdl_helloworld into master. So the fix for this problem should be also in master. Could you test it again and close the issue if the tests are passing.

luyong6 commented 6 years ago

Yes it has been fixed already. Thank you. Close it.