uwcms / cms-calo-layer1

Xilinx Microblaze Projects for RCT Upgrade
2 stars 8 forks source link

ASIO reported a Timeout in TCP callback #61

Open isobelojalvo opened 10 years ago

isobelojalvo commented 10 years ago

Hi guys (@jtikalsky @mathiasblake @ekfriis @tsarangi @dabelknap), I spent a few hours debugging this today and have come up with no solutions. I am following twiki instructions and previous steps for building the backend server binary and starting a new server appear to be working successfully. I have tried reverting to an old version of previously working bitfiles/code that tapas had saved in a tarball to no success. Any suggestions on other things I can do to debug would be fantastic. I have read through most of the issue-logs on github at this point... and nothing jumped out at me to check. (Error message below.)

(from twiki) /usr/bin/python cli.py --connection ctp6_connections.xml --verbose status 1 5 12

And get this error: /usr/bin/python cli.py --connection ctp6_connections.xml --verbose status 1 5 12 INFO:main:Setting up connection: file://ctp6_connections.xml 19-02-14 15:28:58.372356 [0x6002290] ERROR - ASIO reported a Timeout in TCP callback Traceback (most recent call last): File "cli.py", line 170, in ? commands[args.command](hw, args) File "cli.py", line 49, in do_status status_flags = api.status(hw, args.links) File "/afs/hep.wisc.edu/cms/ojalvo/tapasCTP6/onsoft_23Oct2013/ctp6commander/api.py", line 156, in status hw.dispatch() RuntimeError: St9exception

For the record, ctp6_connections.xml is: <?xml version="1.0" encoding="UTF-8"?>

ekfriis commented 10 years ago

Hi Isobel,

We saw something similar this summer, see here [1]. The issue was the CTP FE was running out memory and crashing, so it would never send the response packet back to the BE, which would wait forever for the expected response. The packet would never get back to the PC, since the BE was still waiting for it, hence the time out you see. You can probably see in the console output on the BE petalinux that the softipbus-forward is waiting for a response from the FE.

I'm not sure if this is the problem you have. I suggest to trace the program flow on the FE using XMD debugger. Austin knows how to do this. If it goes off into crazy address space land you know something bad has happened and the uBlaze has crashed.

Good luck! Evan

[1] https://github.com/uwcms/cms-calo-layer1/issues/26#issuecomment-26906699

isobelojalvo commented 10 years ago

Hi @ekfriis, Thanks for getting back to me so quickly. Yes, I do see that the BE petalinux is waiting for a response. Since Austin is probably asleep is this the manual on the XMD debugger? http://www.ece.unm.edu/fpga/carla/Lab_microprocessors/XMD%20commands%20REFERENCE%20V.0.2.pdf

Was this the final solution?: reducing size of the circular buffers (softipbus/include/circular_buffer.h) from 512 to 256. I think we may have had an out-of-memory problem on the CTP6 FE.

ekfriis commented 10 years ago

Yes, that's the manual.

In a nutshell - look at the .S file (the annotated assembly. it has the PC (program counter location)). You can run the front end using XMD, the stp it, then it will tell you the PC. You can check in the disassembled source to see if you are in a weird place (i.e. not in a normal loop in the main program).

Yes, that was the final solution. You could try reducing it ever further.

On Wed, Feb 19, 2014 at 4:26 PM, isobelojalvo notifications@github.comwrote:

Hi @ekfriis https://github.com/ekfriis, Thanks for getting back to me so quickly. Yes, I do see that the BE petalinux is waiting for a response. Since Austin is probably asleep is this the manual on the XMD debugger? http://www.ece.unm.edu/fpga/carla/Lab_microprocessors/XMD%20commands%20REFERENCE%20V.0.2.pdf

Was this the final solution?: reducing size of the circular buffers (softipbus/include/circular_buffer.h) from 512 to 256. I think we may have had an out-of-memory problem on the CTP6 FE.

Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/61#issuecomment-35569145 .

dabelknap commented 10 years ago

I had IO_BUFFER_SIZE set to 128 when I tried it.

isobelojalvo commented 10 years ago

After changing the port number I have had some limited success:

/usr/bin/python cli.py --connection ctp6_connections.xml --verbose status 1 5 12

INFO:main:Setting up connection: file://ctp6_connections.xml

Flag 1 5 12

LossSync E E E PLLk OK E E E Overflow * * * ErrDetect E E E Underflow * * E INFO:main:done.

I used commit **, and changed all the port variables to 60003. For the record, I also changed the buffer size to 128, I will switch back to 256 tomorrow and see what happens. It seems 60001 on ayinger was in use by Jes as of yesterday. That doesn't explain why we couldn't get these this output over the last few months.

@jtikalsky: I'm lost and maybe this is due to my poor understanding of the Server-Client relationship but the local address on the CTP6 is set to use port 60001 why would that be an issue from ayinger?

Note to future users: You must change the port number in softipbus, run make clean, re run all make processes and then rerun 'make upload2ctp' and 'make forwardonctp', you must also change ctp6_connections.xml to 60003. Also to check if the port is busy or free ssh using root on the the ctp linux and run: 'netstat -tulpn'; hitting Ctrl+C to end the cli.py causes the port to remain open meaning that you can't use it later. To kill it in a mean way you can find the PID and kill the process. Make sure the process you are killing is the one you really want to kill!!

**c9500425da3dd90782896b297e8cad114f34a2a7

jtikalsky commented 10 years ago

@jtikalsky https://github.com/jtikalsky: I'm lost and maybe this is due to my poor understanding of the Server-Client relationship but the local address on the CTP6 is set to use port 60001 why would that be an issue from ayinger?

This is due to the way you were connecting to it. I believe your system is set up to forward the port, so you can connect from your local system to the CTP6 by forwarding data through ayinger. This method might be using the 60001 port on ayinger. I havent read the code and don't know the details of the exact configuration however. If you ran this on ayinger, the forward would be a local loopback, in that case, and would occupy the port for no real gain.
dabelknap commented 10 years ago

@isobelojalvo When you changed the cbuffersize to 128, did you just rerun make? You need to run make clean first. Otherwise, make doesn't recognize that include/circular_buffer.h has changed. The change causes a unittest failure in tests/circular_buffer.c, since it assumes that the buffer can hold at least 200. Also, it causes the server to print

DEBUG: Processing full transaction
DEBUG: Waiting for result

over and over again when cli.py is run.

isobelojalvo commented 10 years ago

@dabelknap Hmm, I thought so but now I can't remember, if this is the error you're seeing then it's possible I didn't run make clean. In any case, run with the buffer size at 256.

dabelknap commented 10 years ago

I tried both reseating the 904 CTP6 and moving it to a new slot in the crate. I tried using ports 60001 and 60003 in both cases. I still get the same cli.py error as before

23-02-14 19:18:51.133612 [0x3d59e30] WARNING - Closing socket for URI "ipbustcp-2.0://0.0.0.0:60003" since deadline has passed
23-02-14 19:18:51.133761 [0x3d59e30] ERROR - Timeout (1000 milliseconds) occurred for receive from TCP server with URI: ipbustcp-2.0://0.0.0.0:60003
23-02-14 19:18:51.133869 [0x3d37010] WARNING - Closing Socket since exception detected.

I have returned the CTP6 to its original slot, and reset it.

@isobelojalvo It turns out I didn't need to change the IP address when I moved the card to a new slot.

isobelojalvo commented 10 years ago

Ok, well it's late there so maybe you won't have time today but the next thing I'd try is re-uploading the bitfile (using twiki instructions). Watch for any funny outputs. If I find any time today I will post what the outputs "should" look like.

If that still doesn't work I'd try checking out the head of the trigger code and trying again... I will try to test your tag 7b9cca31fc2e7a926997b1b813ac70f9008434ac in Chamberlin today as well.