Pi4 gisb stalls when using genet ethernet

cinaplenrek commented 4 years ago

I'm working on plan9 arm64 kernel support for the raspberry pi 4.

I'm observing gisb arbiter errors when operating the ethernet controller on the raspberry pi 4. In general, ethernet works fine on light traffic but heavy traffic causes sporadic 42 second long bus stalls. That is, any core accessing mmio registers on the gisb (genet, pcie) hangs and then continues. Even accesses to the gisb arbiter itself hang.

After such a stall, when i poll (as i dont know the INTID for the arbiter) the gisb arbiter capture status register (0x7c4007f4) reads 0x3D and the bus address reported in the capture address registers ([0x7c4007ec] | [0x7c4007e8]<<32) reads strange 12 bit bus addresses like: 0x2a0, 0xfe0, 0xea0, 0xee0, 0x6a0... (they'r all (x-32)%64 == 0)

Normally, when the arm accesses invalid mmio registers on the bus i get an SErr interrupt and the arbiter capture address registers contain a proper bus address above 0x7c000000. This is not the case here.

Is it possible for the arm to issue bus access to such addresses? And if so, how? If not, who could initiate such bus transactions?

Can someone tell me the INTID for the gisb arb error interrupt and how the interrupt can be enabled besides enabling it in the GIC? Maybe polling the arbiter results in these bogus addresses?

What i could figure out so far:

stalls happen for both read and write accesses, and the register doesnt matter
hanging mmio write accesses seem to complete fine after the stall. that is i tested reading back the registers i write in the ethernet driver after write and the new value got updated.
the 42 second stall time is also unrelated to the arbiter timeout value in the arbiter timer register 0x7c400008
serializing all genet register accesses and placing barriers before and after has no effect
linux works fine, and i made a trace of all mmio register writes to check for differences in genet initialization but they match: http://felloff.net/usr/cinap_lenrek/pi4iodump.txt

Speculation:

the stall time of 42 seconds is the same time a 32 bit counter would wrap at 100MHz

popcornmix commented 4 years ago

@P33M any ideas?

P33M commented 4 years ago

What size of accesses are you using to read/write GISB registers?

cinaplenrek commented 4 years ago

all 32 bit, naturally aligned.

-- cinap

P33M commented 4 years ago

Decoding the error capture status register (0x7c4007f4) - the error was not caused by a slave response timeout, the error was not caused by a slave response error, and the bus cycle was a read. Oddly, none of the 4 byte strobes in [5:2] are asserted (1 => not asserted). How can we have a read cycle with no byte strobes?

Does the status register ever change (i.e. is it the same for both read and write)?

Is Plan 9 using the firmware clock setup or have any modifications been made to any of the clock generators?

Edit: also, can you capture the GISB master source register at 0x7c4007f8? It's a bitmask of who generated the address that generated the fault.

cinaplenrek commented 4 years ago

the status register always reads 0x3D for reads.

when i deliberately do a mmio write of zero to 0x7dfffff0, the gisb status register changes to 0x3F and the proper bus address is reported.

the clock manager registers have not been touched. however, we issue firmware request SetClkSpd (0x00038002) with the value returned by firmware request GetClkMax (0x00030004) with clock id 3 (ClkArm) on boot.

in timer initialization, we write 0 to 0x40000000 to switch to osc clock and setup the prescaler for 1MHz by writing register 0x40000008 to ((1MHz<<32) / 54MHz)) & ~1 == 0x4bda12e.

cinaplenrek commented 4 years ago

the gisb master source register 0x7c4007f8 always reads 1. sorry for not mentioning it.

cinaplenrek commented 4 years ago

is there anything else i can try to rule out potential problem sources? the clock generators where mentioned... i have core_freq=250 in config.txt for the mini uart console to work. are there any config.txt properties i can try to change to rule out clock or power issues?

pelwell commented 4 years ago

Try with core_freq=500 and core_freq_min=500 - 250 is possibly too low.

cinaplenrek commented 4 years ago

with core_freq=500 and core_freq_min=500 in config.txt, the mini uart breaks as expected. so i enabled pcie and xhci to use usb keyboard to type commands into the machine. but the gisb errors persist. i also removed the setclkrate firmware request so the arm now runs at its initial 600MHz.

any other ideas?

pelwell commented 4 years ago

I suppose suggesting using Linux is not helpful?

cinaplenrek commented 4 years ago

done some experiments. it seems the byte strobe bits [5:2] in the status register are just not inverted?

read 8 bit addr=0x7dfffff0 status=0x05 strobe=0b0001 read 8 bit addr=0x7dfffff1 status=0x09 strobe=0b0010 read 8 bit addr=0x7dfffff2 status=0x11 strobe=0b0100 read 8 bit addr=0x7dfffff3 status=0x21 strobe=0b1000 read 16 bit addr=0x7dfffff0 status=0x0d strobe=0b0011 read 16 bit addr=0x7dfffff2 status=0x31 strobe=0b1100 read 32 bit addr=0x7dfffff0 status=0x3d strobe=0b1111

also, interestingly, reading bus address 0x7dfffff0 with the dma controller yields master source 0x40 and status 0x103d.

so master source 0x1 is the arm and 0x40 is dma controller?

P33M commented 4 years ago

The only other thing I can think of would be the cacheability of the address space in question - what page protection bits are being used?

cinaplenrek commented 4 years ago

spot on.

the pte's for the mmio regions where missing the XN bits.

apparently the chip was doing speculative instruction fetches from the device mappings...

case closed.

raspberrypi / firmware

Pi4 gisb stalls when using genet ethernet #1219