uwcms / cms-calo-layer1

Xilinx Microblaze Projects for RCT Upgrade
2 stars 8 forks source link

Ensure we can read out link status from the CTP6 FE at Chamberlin #4

Closed ekfriis closed 11 years ago

ekfriis commented 11 years ago

You need to:

  1. Compile the softipbus TCP server for Petalinux. See the README here [1] for details.
  2. Upload the softipbus-forward server to /tmp on the CTP6 backend. Get the appropriate IP address and ssh password (user is root) from Jes.
  3. SSH and run /tmp/soft-ipbus on the BE (leave it running)
  4. Compile ctp6_fe_uart_ipbus and upload it (make upload in its directory)
  5. Use ctp6commander [2] CLI to check the link status
  6. Document these instructions in the README.md

[1] https://svnweb.cern.ch/trac/cactus/browser/trunk/cactuscore/softipbus/README.md [2] https://github.com/uwcms/ctp6commander

nwoods commented 11 years ago

Are you sure that makefile is actually correct? I changed it to not look like that and I haven't pushed my changes upstream yet.

On Tue, Oct 15, 2013 at 3:45 PM, Evan K. Friis notifications@github.comwrote:

explain the difference between softipbus and softipbus-forward

softipbus and softipbus-forward are separate programs.

softipbus runs a TCP server which reads out the local memory - so this is what needs to run to read memory from the backend, if we want to. We have not needed this as a use-case to date. Sidebar: once we get the ZYNQ chip, we will only need this program, since the Mathias will do magic to make the front end appear as back-end memory.

softipbus-forward runs a TCP server (on the backend), and then forwards the packets across the UART, to the front end. The UART devices forwarded over are defined in the softipbus makefile [1] - the current defaults should be correct.

Note that you also need to run the standalone program (ctp6_fe_uart_ipbus) on the front end [2], which reads the data sent on the UART, parses it, and sends a response back along the UART. The backend then sends this back over TCP.

if we want to include these in the petalinux flash, how do we do that?

ask Jes. at some point she added it, but I'm not sure what happened to that.

[1] https://svnweb.cern.ch/trac/cactus/browser/trunk/cactuscore/softipbus/Makefile#L35 [2] https://github.com/uwcms/cms-calo-layer1/tree/master/ctp6_fe_uart_ipbus

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26370587 .

nwoods commented 11 years ago

Mathias guessed that the problem we were having was due to a bad FPGA configuration and reflashed with the newest available. Unfortunately, around the same time he did that (but before, so we know the flash isn't responsible), JTAG mysteriously stopped working from Ayinger. The light on the JTAG box is green, so the box thinks it's connected, and it worked fine from Mathias's laptop, so the problem isn't on the card, so it must be an issue with Ayinger. We tried to find Jes to see if she could figure anything out, but she wasn't in. Any idea what sort of thing could cause this?

nwoods commented 11 years ago

We got JTAG working on Tapas's laptop (it still needs to be fixed on Ayinger), and tried to read the links. Now when we run read_uart and payload.elf, xmd now spams the error message

| DEBUG | Partial transaction

a few times a second. On the other hand, the stop command now actually stops it with no problems. cli.py fails in the same way it always did. softipbus-forward no longer produces an error message when we try to run cli.py.

ekfriis commented 11 years ago

@jtikalsky is the only one that knows the black art of fixing the JTAG, I asked her to document it in #23

If you are getting any partial transaction on the FE, that tells me you are getting something transmitted across the UART successfully. Can you try reducing the LOG_LEVEL to 2 (INFO only) and running again? It may be that it times out because writing the DEBUG output to the XMD console is too slow (the bane of debugging on these devices).

I don't get your comment about the Makefile [1], can you clarify?

If the system is not in a "clean" state, things can get wedged - i.e. if softipbus-forward is still waiting for a response it will never get. (Eventually we should add some type of timeout). Can you try:

  1. stop softipbus-forward with Ctrl-C
  2. stop,rst,run the FE using XMD
  3. restart softipbus-forward
  4. try cli.py again?

If it fails, please post the console output of each of the three elements.

[1] https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26376140

ekfriis commented 11 years ago

I got our cable working, I put the steps that I followed in #23

ekfriis commented 11 years ago

Hi @nwoods, can you make sure any changes you have made are commited and pull-requested, and then document the most-correct steps to date? Pam can try doing it at 904.

nwoods commented 11 years ago

I'm going to attempt to fix the blocking error in the FE code. Is the loop in question in read_uart or in payload?

On Wed, Oct 16, 2013 at 8:46 AM, Evan K. Friis notifications@github.comwrote:

Hi @nwoods https://github.com/nwoods, can you make sure any changes you have made are commited and pull-requested, and then document the most-correct steps to date? Pam can try doing it at 904.

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26418619 .

ekfriis commented 11 years ago

Which blocking error? And what is read_uart or payload? read_uart is an XMD command - payload is the name of the generated file, right?

On Wed, Oct 16, 2013 at 6:21 PM, nwoods notifications@github.com wrote:

I'm going to attempt to fix the blocking error in the FE code. Is the loop in question in read_uart or in payload?

On Wed, Oct 16, 2013 at 8:46 AM, Evan K. Friis notifications@github.comwrote:

Hi @nwoods https://github.com/nwoods, can you make sure any changes you have made are commited and pull-requested, and then document the most-correct steps to date? Pam can try doing it at 904.

— Reply to this email directly or view it on GitHub< https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26418619> .

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26433578 .

nwoods commented 11 years ago

Ah, that answers my question. I thought that read_uart was something that you wrote. By blocking error, I mean the (possible) problem that the FE software is getting stuck in a loop waiting for the end of a fake/broken/erroneous packet. Mathias said he talked to you about this?

ekfriis commented 11 years ago

Yes - but don't fix this now! The symptom of this is thing not being started correctly/one end crashing, which means you have to restart all the pieces [see this comment https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26398597] before things will work again.

You should be able to make it work at least once (i.e. by getting everything going so no bad packets are sent) The fix on having a timeout in the loop is for long term stability. Do not try to add this feature until you figure out why the current code is broken - we know the whole chain worked in August.

ekfriis commented 11 years ago

(That being said, adding this feature will be an excellent improvement and a very good opportunity to get your hands dirty, but let's get it working first)

nwoods commented 11 years ago

With the most recent versions of everything, the link test fails with the message

nwoods@ayinger /afs/hep.wisc.edu/cms/nwoods/ctp6commander  master$ python cli.py status
Traceback (most recent call last):
  File "cli.py", line 167, in <module>
    commands[args.command](hw, args)
  File "cli.py", line 48, in do_status
    status_flags = api.status(hw, args.links)
  File "/afs/hep.wisc.edu/cms/nwoods/ctp6commander/api.py", line 156, in status
    hw.dispatch()
uhal._core.exception: 
I'm terribly sorry to have to tell you this, but it appears that there was an exception:
 * Exception type: uhal::exception::TcpConnectionFailure
 * Description: Exception class to handle the case where the TCP connection was refused or aborted.
 * Additional Information:
   > ASIO reported an error: Connection refused
 * Exception occured in the same thread as that in which it was caught (0x137dce50)
 * Exception constructed at time:              2013-10-16 12:39:22.113391
 * Exception's what() function called at time: 2013-10-16 12:39:22.113639

softipbus-forward says nothing, even though it's compiled with log level 3.

Meanwhile, when payload.elf is run, it gives the message

1970-01-01 00:00:00 | DEBUG | Setup interrupts okay
1970-01-01 00:00:00 | INFO  | Serving memory.
1970-01-01 00:01:44 | INFO  | Start size: 0
1970-01-01 00:00:16 | DEBUG | Partial transaction
1970-01-01 00:00:16 | DEBUG | Partial transaction
1970-01-01 00:00:16 | DEBUG | Partial transaction
[... forever]

It spams that partial transaction method forever regardless of whether or not softipbus is running.

ekfriis commented 11 years ago

softipbus-forward should say something with log-level three - please see if you can figure out why it isn't.

On Wed, Oct 16, 2013 at 7:47 PM, nwoods notifications@github.com wrote:

With the most recent versions of everything, the link test fails with the message

nwoods@ayinger /afs/hep.wisc.edu/cms/nwoods/ctp6commander master$ python cli.py status Traceback (most recent call last): File "cli.py", line 167, in commands[args.command](hw, args) File "cli.py", line 48, in do_status status_flags = api.status(hw, args.links) File "/afs/hep.wisc.edu/cms/nwoods/ctp6commander/api.py", line 156, in status hw.dispatch() uhal._core.exception: I'm terribly sorry to have to tell you this, but it appears that there was an exception:

  • Exception type: uhal::exception::TcpConnectionFailure
  • Description: Exception class to handle the case where the TCP connection was refused or aborted.
  • Additional Information:

    ASIO reported an error: Connection refused

  • Exception occured in the same thread as that in which it was caught (0x137dce50)
  • Exception constructed at time: 2013-10-16 12:39:22.113391
  • Exception's what() function called at time: 2013-10-16 12:39:22.113639

softipbus-forward says nothing, even though it's compiled with log level 3.

Meanwhile, when payload.elf is run, it gives the message

1970-01-01 00:00:00 | DEBUG | Setup interrupts okay 1970-01-01 00:00:00 | INFO | Serving memory. 1970-01-01 00:01:44 | INFO | Start size: 0 1970-01-01 00:00:16 | DEBUG | Partial transaction 1970-01-01 00:00:16 | DEBUG | Partial transaction 1970-01-01 00:00:16 | DEBUG | Partial transaction [... forever]

It spams that partial transaction method forever regardless of whether or not softipbus is running.

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26441187 .

nwoods commented 11 years ago

Looking at the Makefile for softipbus (which is the only thing in there I ever changed, IIRC), it looks like we need separate versions for 904 and Chamberlin, so unless you think my version will work properly there, I'm going to hold off on committing those changes for now.

ekfriis commented 11 years ago

That's fine, what you could do (just for completeness) is do copy paste the output of svn diff in the softipbus.

On Wed, Oct 16, 2013 at 9:08 PM, nwoods notifications@github.com wrote:

Looking at the Makefile for softipbus (which is the only thing in there I ever changed, IIRC), it looks like we need separate versions for 904 and Chamberlin, so unless you think my version will work properly there, I'm going to hold off on committing those changes for now.

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26448032 .

nwoods commented 11 years ago

I'm not 100% sure, but there may be a cleaner fix to the python version issue contained in the file cactus/trunk/cactuscore/uhal/pycohal/MANIFEST.in

ekfriis commented 11 years ago

what python version issue?

nwoods commented 11 years ago

The 2.4 vs 2.6 issues, that had us bending over backwards to install new versions of python, set mysterious python path variables, etc.

ekfriis commented 11 years ago

Ah cool, please post another issue/PR with this. In any case, as long as it works, don't worry too much about installing a good python, it will be obviated when we upgrade to SLC6.

On Thu, Oct 17, 2013 at 7:43 PM, nwoods notifications@github.com wrote:

The 2.4 vs 2.6 issues, that had us bending over backwards to install new versions of python, set mysterious python path variables, etc.

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26530926 .

jtikalsky commented 11 years ago

Apologies, I've been out of the office most of the week.

0.0.0.0 is not an IP you can connect to. You need to use something else. Perhaps you meant 127.0.0.1?

On 10/14/2013 05:04 PM, nwoods wrote:

Maybe the -vvv option on ssh gets us somewhere. When I did that, the error message became

|~ # /tmp/softipbus-forward debug1: Connection to port 60002 forwarding to 0.0.0.0 port 60002 requested. debug2: fd 9 setting TCP_NODELAY debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug1: channel 3: new [direct-tcpip] channel 3: open failed: connect failed: debug1: channel 3: free: direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888, nchannels 4 debug3: channel 3: status: The following connections are open:

2 client-session (t4 r0 i0/0 o0/0 fd 6/7 cfd -1)

3 direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888 (t3 r-1 i0/0 o0/0 fd 9/9 cfd -1)

debug3: channel 3: close_fds r 9 w 9 e -1 c -1 |

I don't know what that means, but I'm sure someone does...

I connected to the CTP with the command

ssh -vvv -L 60002:0.0.0.0:60002 root@192.168.1.31

— Reply to this email directly or view it on GitHub https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26292763.

jtikalsky commented 11 years ago

"Simple"... I suppose it depends on what you consider simple. You'd need to take the system.xml and follow the board bringup guide, at least as far as producing the petalinux_bsp, (you shouldnt need ot produce fs-boot itself, though it wont hurt).

There are some.. oddities in that process, issues that need to be corrected along the way. So I think I'd actually say it's not particularly simple. I would offer to do it for you but I'm not sure I can get an X-forwarded connection through to 904 properly.

Unfortunately I'd have to go through it again fully, locally, in order to produce proper instructions.

On 10/14/2013 07:42 AM, Evan K. Friis wrote:

@nwoods https://github.com/nwoods

cc @dabelnap @jtikalsky https://github.com/jtikalsky

Hi, we are now working on this same task at 904. @nwoods https://github.com/nwoods, can you send us the modifications to the makefile to make it work? @jtikalsky https://github.com/jtikalsky is there a simple way to build the petalinux config for microblaze so the "correct" way of building works as well?

— Reply to this email directly or view it on GitHub https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26252567.

nwoods commented 11 years ago

We've been using 0.0.0.0 for quite a while (always from Ayinger, which might change it) and it's seemed to work for ssh, etc. The TCP parts of this seem to be working, so I doubt that's the problem. Happy to be wrong if it gets fixed, of course...

On Fri, Oct 18, 2013 at 11:08 AM, jtikalsky notifications@github.comwrote:

Apologies, I've been out of the office most of the week.

0.0.0.0 is not an IP you can connect to. You need to use something else. Perhaps you meant 127.0.0.1?

On 10/14/2013 05:04 PM, nwoods wrote:

Maybe the -vvv option on ssh gets us somewhere. When I did that, the error message became

|~ # /tmp/softipbus-forward debug1: Connection to port 60002 forwarding to 0.0.0.0 port 60002 requested. debug2: fd 9 setting TCP_NODELAY debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug1: channel 3: new [direct-tcpip] channel 3: open failed: connect failed: debug1: channel 3: free: direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888, nchannels 4 debug3: channel 3: status: The following connections are open:

2 client-session (t4 r0 i0/0 o0/0 fd 6/7 cfd -1)

3 direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect

from 127.0.0.1 port 46888 (t3 r-1 i0/0 o0/0 fd 9/9 cfd -1)

debug3: channel 3: close_fds r 9 w 9 e -1 c -1 |

I don't know what that means, but I'm sure someone does...

I connected to the CTP with the command

ssh -vvv -L 60002:0.0.0.0:60002 root@192.168.1.31

— Reply to this email directly or view it on GitHub <https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26292763 .

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608141 .

jtikalsky commented 11 years ago

I can't say you're WRONG, but connecting to 0.0.0.0 is really Not something that should work.

0.0.0.0 is the special case address for 'listen on all addresses'. It's not an address a system should normally be expected to respond to.

If you want to forward port 60002 on your system to 60002 on the card, 127.0.0.1 is the proper IP address for this. "-L60002:127.0.0.1:60002 translates to "Take connections on 60002 locally, forward them to the remote endpoint, then connect them to 127.0.0.1 (localhost), port 60002". I've no idea why 0.0.0.0 worked for you, honestly.

On 10/18/2013 11:12 AM, nwoods wrote:

We've been using 0.0.0.0 for quite a while (always from Ayinger, which might change it) and it's seemed to work for ssh, etc. The TCP parts of this seem to be working, so I doubt that's the problem. Happy to be wrong if it gets fixed, of course...

On Fri, Oct 18, 2013 at 11:08 AM, jtikalsky notifications@github.comwrote:

Apologies, I've been out of the office most of the week.

0.0.0.0 is not an IP you can connect to. You need to use something else. Perhaps you meant 127.0.0.1?

On 10/14/2013 05:04 PM, nwoods wrote:

Maybe the -vvv option on ssh gets us somewhere. When I did that, the error message became

|~ # /tmp/softipbus-forward debug1: Connection to port 60002 forwarding to 0.0.0.0 port 60002 requested. debug2: fd 9 setting TCP_NODELAY debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug1: channel 3: new [direct-tcpip] channel 3: open failed: connect failed: debug1: channel 3: free: direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888, nchannels 4 debug3: channel 3: status: The following connections are open:

2 client-session (t4 r0 i0/0 o0/0 fd 6/7 cfd -1)

3 direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect

from 127.0.0.1 port 46888 (t3 r-1 i0/0 o0/0 fd 9/9 cfd -1)

debug3: channel 3: close_fds r 9 w 9 e -1 c -1 |

I don't know what that means, but I'm sure someone does...

I connected to the CTP with the command

ssh -vvv -L 60002:0.0.0.0:60002 root@192.168.1.31

— Reply to this email directly or view it on GitHub

<https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26292763 .

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608141

.

— Reply to this email directly or view it on GitHub https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608510.

ekfriis commented 11 years ago

Hmm, maybe this is the root of all our problems :) @nwoods, maybe 127.0.0.1 will help. Although I think that this probably isn't the type of thing to fail half-way.

Jes, we can compile correctly with a non-specific install of petalinux now. We jsut hardcode the path to mb-gcc in. I think this is fine, even for the long term, so no painful X-forwarded-over-the-atlantic :).

Thanks

Evan

On Fri, Oct 18, 2013 at 6:19 PM, jtikalsky notifications@github.com wrote:

I can't say you're WRONG, but connecting to 0.0.0.0 is really Not something that should work.

0.0.0.0 is the special case address for 'listen on all addresses'. It's not an address a system should normally be expected to respond to.

If you want to forward port 60002 on your system to 60002 on the card, 127.0.0.1 is the proper IP address for this. "-L60002:127.0.0.1:60002 translates to "Take connections on 60002 locally, forward them to the remote endpoint, then connect them to 127.0.0.1 (localhost), port 60002". I've no idea why 0.0.0.0 worked for you, honestly.

On 10/18/2013 11:12 AM, nwoods wrote:

We've been using 0.0.0.0 for quite a while (always from Ayinger, which might change it) and it's seemed to work for ssh, etc. The TCP parts of this seem to be working, so I doubt that's the problem. Happy to be wrong if it gets fixed, of course...

On Fri, Oct 18, 2013 at 11:08 AM, jtikalsky notifications@github.comwrote:

Apologies, I've been out of the office most of the week.

0.0.0.0 is not an IP you can connect to. You need to use something else. Perhaps you meant 127.0.0.1?

On 10/14/2013 05:04 PM, nwoods wrote:

Maybe the -vvv option on ssh gets us somewhere. When I did that, the error message became

|~ # /tmp/softipbus-forward debug1: Connection to port 60002 forwarding to 0.0.0.0 port 60002 requested. debug2: fd 9 setting TCP_NODELAY debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug1: channel 3: new [direct-tcpip] channel 3: open failed: connect failed: debug1: channel 3: free: direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888, nchannels 4 debug3: channel 3: status: The following connections are open:

2 client-session (t4 r0 i0/0 o0/0 fd 6/7 cfd -1)

3 direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect

from 127.0.0.1 port 46888 (t3 r-1 i0/0 o0/0 fd 9/9 cfd -1)

debug3: channel 3: close_fds r 9 w 9 e -1 c -1 |

I don't know what that means, but I'm sure someone does...

I connected to the CTP with the command

ssh -vvv -L 60002:0.0.0.0:60002 root@192.168.1.31

— Reply to this email directly or view it on GitHub

<https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26292763 .

— Reply to this email directly or view it on GitHub< https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608141>

.

— Reply to this email directly or view it on GitHub <https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608510 .

— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26609086 .

tsarangi commented 11 years ago

@ekfriis 0.0.0.0 isn't the issue. changing it to 127.0.0.1 produces the same error as we have been discussing.

ekfriis commented 11 years ago

I think we are reaching a conclusion in the monster successor of this monster thread, see https://github.com/uwcms/cms-calo-layer1/issues/26#issuecomment-26906699