RARP sometimes doesn`t work after power up or reboot of FPGA

mkrivda commented 6 months ago

Hello.

After power-up or reboot of FPGA we have isuue with RARP. Sometimes it is not working and it is not clear why. When we check FPGA via JTAG, we always see that FPGA was programmed correctly. We are using ipbus fw ver.1.13.

Marian

dpcsankey commented 6 months ago

Tell me more. One particular module, or random module across a whole set (how large). Do you see the RARP requests going out on the network. Does the module think that it's got an IP address.

mkrivda commented 6 months ago

Random module from set of ~20 modules. We don`t see RARP request from this module.

How to check if the module think it`s got the IP address ?

dpcsankey commented 6 months ago

There's a port on ipbus_ctrl Got_IP_addr: OUT std_logic; In our designs I use this to control the 1 Hz blink LED, so I double-blink it whilst waiting for IP address.

mkrivda commented 6 months ago

The 1 Hz link LED is off. When I access FPGA via JTAG (VIVADO hw manager), I see FPGA is programmed. After "refresh fpga" there is no change. After "boot fpga from memory device" it starts to work.

mkrivda commented 6 months ago

I have tested the most frequent board for power on/off cycle (10 x). 8x RARP request doesn`t come. 2x RARP request has come.

Before we didn`t see this behavior.

dpcsankey commented 6 months ago

Default way of driving the '1 Hz' LED is the '1 Hz' signal coming out of the 'clocks' entity anded with the locked signal from the MMCM(s) for the IPBus clock and the Ethernet clock (details here depend on which PHY you are using). So no blink would suggest no lock?

mkrivda commented 6 months ago

Clocks from Si5345 are present. I don`t understand why MMCM is not not able to lock.

Another hint: Was there any change to IP cores: MAC a PHY for ver.1.13 ? I have only upgraded old version of them using VIVADO 2020.2.

dpcsankey commented 6 months ago

Which PHY are you using? What version had you been running previously? git diff suggests that the release notes for the various releases are true and I see that Alessandro played with the clock constraints in v1.10, https://github.com/ipbus/ipbus-firmware/issues/107

mkrivda commented 6 months ago

I am using PHY ver. 16.2. Before it was 16.1.

I don`t understand why to constrain ipbus_clk separately if it is generated clock. In my constraints I have only sysclk.

set_clock_groups -asynchronous -group [get_clocks -include_generated_clocks sysclk] \ -group [get_clocks -include_generated_clocks eth_refclk] \ -group [get_clocks -include_generated_clocks {ddr4_0_inst0_c0_sys_clk_p ddr4_1_inst0_c0_sys_clk_p sys_clk_p_i}] \ -group [get_clocks -include_generated_clocks onu_clk_rxref240] \ -group [get_clocks -include_generated_clocks {CLKBC40 gth_ref_clk}] \ -group [get_clocks -include_generated_clocks rxoutclk_out[0]_1] \ -group [get_clocks -include_generated_clocks rxoutclk_out[0]_2] \ -group [get_clocks -include_generated_clocks rxoutclk_out[0]_3]

dpcsankey commented 6 months ago

These constraints wouldn't affect the MMCM tho'.

My take so far, I don't think it's me! If you've got the standard 1 Hz blink gated with lock and you see no blink then this says no lock. So this points to the instantiation of the MMCMs? Also there haven't been changes in the default IP since release v1.5 (gig_eth_pcs_pma_gmii_to_sgmii_bridge)

mkrivda commented 6 months ago

It doesn`t get get mmcm_locked from gig_ethernet_pcs_pma_basex_156_25. The clock 156.25 MHz is preset. I try to re-generate IP core.

mkrivda commented 4 months ago

I have re-generated IP core gig_ethernet_pcs_pma_basex_156_25. I have enabled DHCP instead RARP. I see still the same problem. Do you know what else I can check ?

dpcsankey commented 4 months ago

If I look at the ports on ipbus_ctrl it sounds like rst_macclk is never asserted. Looking at the ports on your clocks entity (clocks_usp_serdes?) this corresponds to rsto_125 never being asserted. On the old designs (say clocks_7s_extphy) this was forced by the rctr logic, but with clocks_usp_serdes that logic is only in the clk_ipb_b clock domain. Could this be a race condition with dcm_locked being too quick?

mkrivda commented 4 months ago

I have folowed signal from ipbus LED.

step locked <= clk_locked and eth_locked; eth_locked -> "0" clk_locked -> "1"

2.step eth_locked <= resetdone and mmcm_locked; resetdone -> "0" MMCM_locked -> "1"

resetdone is out from gig_ethernet_pcs_pma_basex_156_25 The only reset input for gig_ethernet_pcs_pma_basex_156_25 is "rsti". rsti => rst_eth rsto_eth <= rst; -- ethernet startup reset (required!) rst <= nuke_d2 or not dcm_locked;

It seems rst_eth is not performed. dcm_locked -> "1" (it was check in step 1)

mkrivda commented 4 months ago

I have checked 2 signals (please see attached pictures):

rst_eth (yellow line)
rst125 (red line)

In case that IPbus is not working after reboot, rst125 stays always in "1". IPbus_dead_after_reboot IPbus_ok_after_reboot

mkrivda commented 4 months ago

rst_eth is sent to gig_ethernet_pcs_pma_basex_156_25, but resetdone is "0" and eth_locked is "0". eth_done <= (eth_done or eth_locked) keeps signal rst125 in "1" forever. A question is: Why gig_ethernet_pcs_pma_basex_156_25 sometimes doesn`t accept rst_eth ?

dpcsankey commented 4 months ago

Can you remind me which chip you are targeting? Poking around it looks like the PHY is either failing to lock its MMCM or it's failing to complete its reset, so we fail to see locked come out of it, but we need to poke at that now.

mkrivda commented 4 months ago

I use Kintex Ultrascale, XCKU040...2E and XCKU060...2E.

mkrivda commented 4 months ago

mmcm_locked_out from gig_ethernet_pcs_pma_basex_156_25 was check in Step 2 (MMCM_locked -> "1"), so I guess it is reset which is failing.

mkrivda commented 3 months ago

Is there anything else related to the reset of PHY to be checked ?

mkrivda commented 3 months ago

Is there any progress for this issue ?

mkrivda commented 1 month ago

I have implemented ipbus_icap_us_usp and ipbus_iprog_us_usp. Both use ICAPE3. Reboot of FPGA via IPROG gives the same result as it is described above.

dpcsankey commented 2 weeks ago

I was wondering if I was seeing something similar with eFEX in Point 1, where we reboot OK doing DHCP negotiation but have some garbage on the network which results in alarms for the NetAdmins. We did packet sniffing with CERN IT this week and my problem looks different to yours. On mine it looks like the initial reset is missing on reload, (MMCM stays locked???) but toggling the enable signal once I've determined MAC address starts the DHCP negotiation, so as I said in #238 this is exactly what the enable port is for. Yours it looks like the gig_ethernet_pcs_pma_basex_156_25 doesn't come out of reset. All I can really suggest is compare the implementation to https://docs.amd.com/r/en-US/pg047-gig-eth-pcs-pma possibly generating the equivalent example design, and/or adding state machine to kick rst_eth again if it sticks?

mkrivda commented 2 weeks ago

I have recompiled test_logic fw using VIVADO 2023.2 and the problem has disappeared. So, I need to use new version of VIVADO also for a production fw.

mkrivda commented 1 week ago

All types of fw are ok after recompilation with VIVADO 2023.2.

ipbus / ipbus-firmware

RARP sometimes doesn`t work after power up or reboot of FPGA #227