nvdla / hw

RTL, Cmodel, and testbench for NVDLA
Other
1.74k stars 568 forks source link

Data mismatch error when running Googlenet sample #85

Closed jeremychritz closed 6 years ago

jeremychritz commented 6 years ago

I've been able to run the other samples, but the Googlenet is failing:

614610.00ns MSEQ: ERROR: Read to address 0xffff0003 did not return the expected value 0xffffffff after 50000 attempts (last read 0x00010000), failing.
$finish called from file "/home/jchritz/nvdla/hw/verif/synth_tb/csb_master_seq.v", line 274.
$finish at simulation time     615110.00ns
           V C S   S i m u l a t i o n   R e p o r t 

Here is the full log: test.log

Here is the command I'm using to run the simulation:

make run TESTDIR=../traces/traceplayer/googlenet_conv2_3x3_int16/ TESTNAME=googlenet_conv2_3x3_int16

Here is what I have set in my tree.make for project:

##=======================                                                                                 
## Project Name Setup, multiple projects supported                                
##=======================                                                                                 
PROJECTS := nv_full

Any ideas on what is going on?

jwise commented 6 years ago

Oop, sorry for the delay on this. That's kind of surprising. Hmm. Do any of the other tests pass (for example, the other sanity tests)? The tests pass last I ran them here, and we haven't changed anything recently...

jeremychritz commented 6 years ago

@jwise Thanks for getting back!

The other test that fails is sanity3_cvsram, I get a similar error:

542530.00ns MSEQ: Read mismatched, retrying (addr: 0xffff0003, received data:  0x00140001, expected data: 0xffffffff.
542610.00ns MSEQ: Retrying read at 0xffff0003 (returned 0x00140001, expected 0xffffffff)...
544890.00ns MSEQ: ERROR: Read to address 0xffff0003 did not return the expected value 0xffffffff after 50000 attempts (last read 0x00140001), failing.

This is the command I ran:

make run TESTDIR=../traces/traceplayer/sanity3_cvsram/ TESTNAME=sanity3_cvsram

All other tests passed just the googlenet and the sanity3_cvsram fail.

Just as a sanity this is the git branch info:

* nvdlav1 2a2c725 Fix a URL typo.

This is the VCS version:

Compiler version M-2017.03-SP2-3_Full64; Runtime version M-2017.03-SP2-3_Full64;  Jan 22 21:50 2018
VCD+ Writer M-2017.03-SP2-3_Full64 Copyright 

I also get the same results with Xilinx's 2017.1 Vivado Simulator, I don't think it is a simulator version?

I'm also diving into the RTL to see if I can better diagnose the problem will report back.

jwise commented 6 years ago

That's very weird. This goes along with:

read_reg 0xffff0003 0x0 0xffffffff #NVDLA_GLB.S_INTR_STATUS_0clear interrupts

Which is trying to read any value from 0xffff0003, which is the interrupt status register -- the 0x0 there means that the mask is 0 (anything should be accepted), and the 0xffffffff is a dummy. So the first read should always succeed. It's sounding to me like the Perl script that compiles an input.txn into a synthtb memory dump is confused somehow, and is not properly parsing the input.txn. Maybe the best place to look is to instrument verif/synth_tb/sim_scripts/inp_txn_to_hexdump.pl, and see if it's correctly parsing that (and, if not, why not)?

jeremychritz commented 6 years ago

Just for clarification:

  1. Is the synthtb memory dump the input.txn.raw file?
  2. Any chance you can send me the input.txn.raw from your run so I can do a diff?
  3. For the parsing I would look at the decoding commands in the comments of inp_txn_to_hexdump.pl ?
jwise commented 6 years ago

Yes, input.txn.raw is what I refer to. Here's an input.txn.raw.txt from sanity3_cvsram, generated with something like:

jwise@jwise-dt:~/nvdla/hw/verif/traces/traceplayer$ ../../synth_tb/sim_scripts/inp_txn_to_hexdump.pl sanity3_cvsram

Yes, inp_txn_to_hexdump.pl would be the way to go.

jeremychritz commented 6 years ago

Thanks! However, I'm not seeing any difference, here is my input.txn.raw.jc.txt. Files are identical.

For sanity3_cvsram, I'm still seeing this error:

542650.00ns MSEQ: Retrying read at 0xffff0003 (returned 0x00140001, expected 0xffffffff)...
544930.00ns MSEQ: ERROR: Read to address 0xffff0003 did not return the expected value 0xffffffff after 50000 attempts (last read 0x00140001), failing.

I did trace the 0x00140001 back to NV_NVDLA_csb_master.v, line 2234:

assign core_resp_pd = ({34 {afbif_resp_pvld}} & afbif_resp_pd)

I added some debug code to capture that value:

//JBC Debug
always @(posedge nvdla_core_clk) begin
    if ( core_resp_pvld ) begin
             $display("%0t {34 {afbif_resp_pvld}} & afbif_resp_pd)       : 0x%08x", $time, ({34 {afbif_resp_pvld}} & afbif_resp_pd)       );
             $display("%0t {34 {cmac_a_resp_valid}} & cmac_a_resp_pd)    : 0x%08x", $time, ({34 {cmac_a_resp_valid}} & cmac_a_resp_pd)    );
             $display("%0t {34 {sdp_rdma_resp_valid}} & sdp_rdma_resp_pd): 0x%08x", $time, ({34 {sdp_rdma_resp_valid}} & sdp_rdma_resp_pd));
             $display("%0t {34 {csc_resp_valid}} & csc_resp_pd)          : 0x%08x", $time, ({34 {csc_resp_valid}} & csc_resp_pd)          );
             $display("%0t {34 {gec_resp_valid}} & gec_resp_pd)          : 0x%08x", $time, ({34 {gec_resp_valid}} & gec_resp_pd)          );
             $display("%0t {34 {cdp_resp_valid}} & cdp_resp_pd)          : 0x%08x", $time, ({34 {cdp_resp_valid}} & cdp_resp_pd)          );
             $display("%0t {34 {cacc_resp_valid}} & cacc_resp_pd)        : 0x%08x", $time, ({34 {cacc_resp_valid}} & cacc_resp_pd)        );
             $display("%0t {34 {glb_resp_valid}} & glb_resp_pd)          : 0x%08x", $time, ({34 {glb_resp_valid}} & glb_resp_pd)          );
             $display("%0t {34 {cvif_resp_valid}} & cvif_resp_pd)        : 0x%08x", $time, ({34 {cvif_resp_valid}} & cvif_resp_pd)        );
             $display("%0t {34 {cmac_b_resp_valid}} & cmac_b_resp_pd)    : 0x%08x", $time, ({34 {cmac_b_resp_valid}} & cmac_b_resp_pd)    );
             $display("%0t {34 {pdp_resp_valid}} & pdp_resp_pd)          : 0x%08x", $time, ({34 {pdp_resp_valid}} & pdp_resp_pd)          );
             $display("%0t {34 {cdma_resp_valid}} & cdma_resp_pd)        : 0x%08x", $time, ({34 {cdma_resp_valid}} & cdma_resp_pd)        );
             $display("%0t {34 {sdp_resp_valid}} & sdp_resp_pd)          : 0x%08x", $time, ({34 {sdp_resp_valid}} & sdp_resp_pd)          );
             $display("%0t {34 {bdma_resp_valid}} & bdma_resp_pd)        : 0x%08x", $time, ({34 {bdma_resp_valid}} & bdma_resp_pd)        );
             $display("%0t {34 {pdp_rdma_resp_valid}} & pdp_rdma_resp_pd): 0x%08x", $time, ({34 {pdp_rdma_resp_valid}} & pdp_rdma_resp_pd));
             $display("%0t {34 {mcif_resp_valid}} & mcif_resp_pd)        : 0x%08x", $time, ({34 {mcif_resp_valid}} & mcif_resp_pd)        );
             $display("%0t {34 {cdp_rdma_resp_valid}} & cdp_rdma_resp_pd): 0x%08x", $time, ({34 {cdp_rdma_resp_valid}} & cdp_rdma_resp_pd));
             $display("%0t {34 {rbk_resp_valid}} & rbk_resp_pd)          : 0x%08x", $time, ({34 {rbk_resp_valid}} & rbk_resp_pd)          );
             $display("%0t {34 {dummy_resp_valid}} & dummy_resp_pd);     : 0x%08x", $time, ({34 {dummy_resp_valid}} & dummy_resp_pd)      );
    end
end

Here is the test.log from that run.

At the end of that log file:

542290.00ns {34 {afbif_resp_pvld}} & afbif_resp_pd)       : 0x00000000
542290.00ns {34 {cmac_a_resp_valid}} & cmac_a_resp_pd)    : 0x00000000
542290.00ns {34 {sdp_rdma_resp_valid}} & sdp_rdma_resp_pd): 0x00000000
542290.00ns {34 {csc_resp_valid}} & csc_resp_pd)          : 0x00000000
542290.00ns {34 {gec_resp_valid}} & gec_resp_pd)          : 0x00000000
542290.00ns {34 {cdp_resp_valid}} & cdp_resp_pd)          : 0x00000000
542290.00ns {34 {cacc_resp_valid}} & cacc_resp_pd)        : 0x00000000
542290.00ns {34 {glb_resp_valid}} & glb_resp_pd)          : 0x00140001
542290.00ns {34 {cvif_resp_valid}} & cvif_resp_pd)        : 0x00000000
542290.00ns {34 {cmac_b_resp_valid}} & cmac_b_resp_pd)    : 0x00000000
542290.00ns {34 {pdp_resp_valid}} & pdp_resp_pd)          : 0x00000000
542290.00ns {34 {cdma_resp_valid}} & cdma_resp_pd)        : 0x00000000
542290.00ns {34 {sdp_resp_valid}} & sdp_resp_pd)          : 0x00000000
542290.00ns {34 {bdma_resp_valid}} & bdma_resp_pd)        : 0x00000000
542290.00ns {34 {pdp_rdma_resp_valid}} & pdp_rdma_resp_pd): 0x00000000
542290.00ns {34 {mcif_resp_valid}} & mcif_resp_pd)        : 0x00000000
542290.00ns {34 {cdp_rdma_resp_valid}} & cdp_rdma_resp_pd): 0x00000000
542290.00ns {34 {rbk_resp_valid}} & rbk_resp_pd)          : 0x00000000
542290.00ns {34 {dummy_resp_valid}} & dummy_resp_pd);     : 0x00000000
542570.00ns MSEQ: Read mismatched, retrying (addr: 0xffff0003, received data:  0x00140001, expected data: 0xffffffff.
542650.00ns MSEQ: Retrying read at 0xffff0003 (returned 0x00140001, expected 0xffffffff)...
544930.00ns MSEQ: ERROR: Read to address 0xffff0003 did not return the expected value 0xffffffff after 50000 attempts (last read 0x00140001), failing.
545410.00ns {34 {afbif_resp_pvld}} & afbif_resp_pd)       : 0x00000000
545410.00ns {34 {cmac_a_resp_valid}} & cmac_a_resp_pd)    : 0x00000000
545410.00ns {34 {sdp_rdma_resp_valid}} & sdp_rdma_resp_pd): 0x00000000
545410.00ns {34 {csc_resp_valid}} & csc_resp_pd)          : 0x00000000
545410.00ns {34 {gec_resp_valid}} & gec_resp_pd)          : 0x00000000
545410.00ns {34 {cdp_resp_valid}} & cdp_resp_pd)          : 0x00000000
545410.00ns {34 {cacc_resp_valid}} & cacc_resp_pd)        : 0x00000000
545410.00ns {34 {glb_resp_valid}} & glb_resp_pd)          : 0x00000000
545410.00ns {34 {cvif_resp_valid}} & cvif_resp_pd)        : 0x00000000
545410.00ns {34 {cmac_b_resp_valid}} & cmac_b_resp_pd)    : 0x00000000
545410.00ns {34 {pdp_resp_valid}} & pdp_resp_pd)          : 0x00000000
545410.00ns {34 {cdma_resp_valid}} & cdma_resp_pd)        : 0x00000000
545410.00ns {34 {sdp_resp_valid}} & sdp_resp_pd)          : 0x00000000
545410.00ns {34 {bdma_resp_valid}} & bdma_resp_pd)        : 0x00000000
545410.00ns {34 {pdp_rdma_resp_valid}} & pdp_rdma_resp_pd): 0x00000000
545410.00ns {34 {mcif_resp_valid}} & mcif_resp_pd)        : 0x00000000
545410.00ns {34 {cdp_rdma_resp_valid}} & cdp_rdma_resp_pd): 0x00000000
545410.00ns {34 {rbk_resp_valid}} & rbk_resp_pd)          : 0x00000000
545410.00ns {34 {dummy_resp_valid}} & dummy_resp_pd);     : 0x00000000

When the data mismatch is display:

542290.00ns {34 {glb_resp_valid}} & glb_resp_pd)          : 0x00140001

After 50000 attempts then:

545410.00ns {34 {glb_resp_valid}} & glb_resp_pd)          : 0x00000000

I think the 0x00000000 is the correct value, but regardless it does change after it failed.

Does this shed any light?

jwise commented 6 years ago

Hmm, ok, hmmm. Um, @zdraw, got any ideas?

zdraw commented 6 years ago

Seeing the same issue. 0x00000000 is the correct value and 0xFFFFFFFF is supposed to be the mask so that error message is wrong. Or at least points to the issue. If you look in verif/synth_tb/sim_scripts/inp_txn_to_hexdump.pl and find where we parse for "read_reg", the order of the arguments is $address = $values[1]; $bitmask = $values[2]; $cmp_mode = "=="; $exp_data = $values[3];

So bitmask (0xffffffff) should be second and exp_data (0x0) third. In sanity3_cvsram and googlenet, these got flipped.

jwise commented 6 years ago

Ok, man is this embarrassing. I apparently screwed this up when I made the Verilator changes.

Commit a8fcfc3 fixes this. @jeremychritz , can you verify that that fixes it for you, too?

jeremychritz commented 6 years ago

Yes, it works here are the log files from googlnet and sanity3_cvsram:

Anything else or can we close this issue?

jwise commented 6 years ago

Looks good to me. I'll close it. Sorry for the mess!