Closed ekfriis closed 11 years ago
Are you sure that makefile is actually correct? I changed it to not look like that and I haven't pushed my changes upstream yet.
On Tue, Oct 15, 2013 at 3:45 PM, Evan K. Friis notifications@github.comwrote:
explain the difference between softipbus and softipbus-forward
softipbus and softipbus-forward are separate programs.
softipbus runs a TCP server which reads out the local memory - so this is what needs to run to read memory from the backend, if we want to. We have not needed this as a use-case to date. Sidebar: once we get the ZYNQ chip, we will only need this program, since the Mathias will do magic to make the front end appear as back-end memory.
softipbus-forward runs a TCP server (on the backend), and then forwards the packets across the UART, to the front end. The UART devices forwarded over are defined in the softipbus makefile [1] - the current defaults should be correct.
Note that you also need to run the standalone program (ctp6_fe_uart_ipbus) on the front end [2], which reads the data sent on the UART, parses it, and sends a response back along the UART. The backend then sends this back over TCP.
if we want to include these in the petalinux flash, how do we do that?
ask Jes. at some point she added it, but I'm not sure what happened to that.
[1] https://svnweb.cern.ch/trac/cactus/browser/trunk/cactuscore/softipbus/Makefile#L35 [2] https://github.com/uwcms/cms-calo-layer1/tree/master/ctp6_fe_uart_ipbus
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26370587 .
Mathias guessed that the problem we were having was due to a bad FPGA configuration and reflashed with the newest available. Unfortunately, around the same time he did that (but before, so we know the flash isn't responsible), JTAG mysteriously stopped working from Ayinger. The light on the JTAG box is green, so the box thinks it's connected, and it worked fine from Mathias's laptop, so the problem isn't on the card, so it must be an issue with Ayinger. We tried to find Jes to see if she could figure anything out, but she wasn't in. Any idea what sort of thing could cause this?
We got JTAG working on Tapas's laptop (it still needs to be fixed on Ayinger), and tried to read the links. Now when we run read_uart
and payload.elf
, xmd now spams the error message
| DEBUG | Partial transaction
a few times a second. On the other hand, the stop
command now actually stops it with no problems. cli.py
fails in the same way it always did. softipbus-forward
no longer produces an error message when we try to run cli.py
.
@jtikalsky is the only one that knows the black art of fixing the JTAG, I asked her to document it in #23
If you are getting any partial transaction on the FE, that tells me you are getting something transmitted across the UART successfully. Can you try reducing the LOG_LEVEL to 2 (INFO only) and running again? It may be that it times out because writing the DEBUG output to the XMD console is too slow (the bane of debugging on these devices).
I don't get your comment about the Makefile [1], can you clarify?
If the system is not in a "clean" state, things can get wedged - i.e. if softipbus-forward
is still waiting for a response it will never get. (Eventually we should add some type of timeout). Can you try:
If it fails, please post the console output of each of the three elements.
[1] https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26376140
I got our cable working, I put the steps that I followed in #23
Hi @nwoods, can you make sure any changes you have made are commited and pull-requested, and then document the most-correct steps to date? Pam can try doing it at 904.
I'm going to attempt to fix the blocking error in the FE code. Is the loop
in question in read_uart
or in payload
?
On Wed, Oct 16, 2013 at 8:46 AM, Evan K. Friis notifications@github.comwrote:
Hi @nwoods https://github.com/nwoods, can you make sure any changes you have made are commited and pull-requested, and then document the most-correct steps to date? Pam can try doing it at 904.
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26418619 .
Which blocking error? And what is read_uart or payload? read_uart is an XMD command - payload is the name of the generated file, right?
On Wed, Oct 16, 2013 at 6:21 PM, nwoods notifications@github.com wrote:
I'm going to attempt to fix the blocking error in the FE code. Is the loop in question in
read_uart
or inpayload
?On Wed, Oct 16, 2013 at 8:46 AM, Evan K. Friis notifications@github.comwrote:
Hi @nwoods https://github.com/nwoods, can you make sure any changes you have made are commited and pull-requested, and then document the most-correct steps to date? Pam can try doing it at 904.
— Reply to this email directly or view it on GitHub< https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26418619> .
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26433578 .
Ah, that answers my question. I thought that read_uart
was something that you wrote. By blocking error, I mean the (possible) problem that the FE software is getting stuck in a loop waiting for the end of a fake/broken/erroneous packet. Mathias said he talked to you about this?
Yes - but don't fix this now! The symptom of this is thing not being started correctly/one end crashing, which means you have to restart all the pieces [see this comment https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26398597] before things will work again.
You should be able to make it work at least once (i.e. by getting everything going so no bad packets are sent) The fix on having a timeout in the loop is for long term stability. Do not try to add this feature until you figure out why the current code is broken - we know the whole chain worked in August.
(That being said, adding this feature will be an excellent improvement and a very good opportunity to get your hands dirty, but let's get it working first)
With the most recent versions of everything, the link test fails with the message
nwoods@ayinger /afs/hep.wisc.edu/cms/nwoods/ctp6commander master$ python cli.py status
Traceback (most recent call last):
File "cli.py", line 167, in <module>
commands[args.command](hw, args)
File "cli.py", line 48, in do_status
status_flags = api.status(hw, args.links)
File "/afs/hep.wisc.edu/cms/nwoods/ctp6commander/api.py", line 156, in status
hw.dispatch()
uhal._core.exception:
I'm terribly sorry to have to tell you this, but it appears that there was an exception:
* Exception type: uhal::exception::TcpConnectionFailure
* Description: Exception class to handle the case where the TCP connection was refused or aborted.
* Additional Information:
> ASIO reported an error: Connection refused
* Exception occured in the same thread as that in which it was caught (0x137dce50)
* Exception constructed at time: 2013-10-16 12:39:22.113391
* Exception's what() function called at time: 2013-10-16 12:39:22.113639
softipbus-forward
says nothing, even though it's compiled with log level 3.
Meanwhile, when payload.elf
is run, it gives the message
1970-01-01 00:00:00 | DEBUG | Setup interrupts okay
1970-01-01 00:00:00 | INFO | Serving memory.
1970-01-01 00:01:44 | INFO | Start size: 0
1970-01-01 00:00:16 | DEBUG | Partial transaction
1970-01-01 00:00:16 | DEBUG | Partial transaction
1970-01-01 00:00:16 | DEBUG | Partial transaction
[... forever]
It spams that partial transaction method forever regardless of whether or not softipbus
is running.
softipbus-forward should say something with log-level three - please see if you can figure out why it isn't.
On Wed, Oct 16, 2013 at 7:47 PM, nwoods notifications@github.com wrote:
With the most recent versions of everything, the link test fails with the message
nwoods@ayinger /afs/hep.wisc.edu/cms/nwoods/ctp6commander master$ python cli.py status Traceback (most recent call last): File "cli.py", line 167, in
commands[args.command](hw, args) File "cli.py", line 48, in do_status status_flags = api.status(hw, args.links) File "/afs/hep.wisc.edu/cms/nwoods/ctp6commander/api.py", line 156, in status hw.dispatch() uhal._core.exception: I'm terribly sorry to have to tell you this, but it appears that there was an exception:
- Exception type: uhal::exception::TcpConnectionFailure
- Description: Exception class to handle the case where the TCP connection was refused or aborted.
- Additional Information:
ASIO reported an error: Connection refused
- Exception occured in the same thread as that in which it was caught (0x137dce50)
- Exception constructed at time: 2013-10-16 12:39:22.113391
- Exception's what() function called at time: 2013-10-16 12:39:22.113639
softipbus-forward says nothing, even though it's compiled with log level 3.
Meanwhile, when payload.elf is run, it gives the message
1970-01-01 00:00:00 | DEBUG | Setup interrupts okay 1970-01-01 00:00:00 | INFO | Serving memory. 1970-01-01 00:01:44 | INFO | Start size: 0 1970-01-01 00:00:16 | DEBUG | Partial transaction 1970-01-01 00:00:16 | DEBUG | Partial transaction 1970-01-01 00:00:16 | DEBUG | Partial transaction [... forever]
It spams that partial transaction method forever regardless of whether or not softipbus is running.
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26441187 .
Looking at the Makefile for softipbus
(which is the only thing in there I ever changed, IIRC), it looks like we need separate versions for 904 and Chamberlin, so unless you think my version will work properly there, I'm going to hold off on committing those changes for now.
That's fine, what you could do (just for completeness) is do copy paste the
output of svn diff
in the softipbus.
On Wed, Oct 16, 2013 at 9:08 PM, nwoods notifications@github.com wrote:
Looking at the Makefile for softipbus (which is the only thing in there I ever changed, IIRC), it looks like we need separate versions for 904 and Chamberlin, so unless you think my version will work properly there, I'm going to hold off on committing those changes for now.
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26448032 .
I'm not 100% sure, but there may be a cleaner fix to the python version issue contained in the file cactus/trunk/cactuscore/uhal/pycohal/MANIFEST.in
what python version issue?
The 2.4 vs 2.6 issues, that had us bending over backwards to install new versions of python, set mysterious python path variables, etc.
Ah cool, please post another issue/PR with this. In any case, as long as it works, don't worry too much about installing a good python, it will be obviated when we upgrade to SLC6.
On Thu, Oct 17, 2013 at 7:43 PM, nwoods notifications@github.com wrote:
The 2.4 vs 2.6 issues, that had us bending over backwards to install new versions of python, set mysterious python path variables, etc.
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26530926 .
Apologies, I've been out of the office most of the week.
0.0.0.0 is not an IP you can connect to. You need to use something else. Perhaps you meant 127.0.0.1?
On 10/14/2013 05:04 PM, nwoods wrote:
Maybe the -vvv option on ssh gets us somewhere. When I did that, the error message became
|~ # /tmp/softipbus-forward debug1: Connection to port 60002 forwarding to 0.0.0.0 port 60002 requested. debug2: fd 9 setting TCP_NODELAY debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug1: channel 3: new [direct-tcpip] channel 3: open failed: connect failed: debug1: channel 3: free: direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888, nchannels 4 debug3: channel 3: status: The following connections are open:
2 client-session (t4 r0 i0/0 o0/0 fd 6/7 cfd -1)
3 direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888 (t3 r-1 i0/0 o0/0 fd 9/9 cfd -1)
debug3: channel 3: close_fds r 9 w 9 e -1 c -1 |
I don't know what that means, but I'm sure someone does...
I connected to the CTP with the command
ssh -vvv -L 60002:0.0.0.0:60002 root@192.168.1.31 — Reply to this email directly or view it on GitHub https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26292763.
"Simple"... I suppose it depends on what you consider simple. You'd need to take the system.xml and follow the board bringup guide, at least as far as producing the petalinux_bsp, (you shouldnt need ot produce fs-boot itself, though it wont hurt).
There are some.. oddities in that process, issues that need to be corrected along the way. So I think I'd actually say it's not particularly simple. I would offer to do it for you but I'm not sure I can get an X-forwarded connection through to 904 properly.
Unfortunately I'd have to go through it again fully, locally, in order to produce proper instructions.
On 10/14/2013 07:42 AM, Evan K. Friis wrote:
@nwoods https://github.com/nwoods
cc @dabelnap @jtikalsky https://github.com/jtikalsky
Hi, we are now working on this same task at 904. @nwoods https://github.com/nwoods, can you send us the modifications to the makefile to make it work? @jtikalsky https://github.com/jtikalsky is there a simple way to build the petalinux config for microblaze so the "correct" way of building works as well?
— Reply to this email directly or view it on GitHub https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26252567.
We've been using 0.0.0.0 for quite a while (always from Ayinger, which might change it) and it's seemed to work for ssh, etc. The TCP parts of this seem to be working, so I doubt that's the problem. Happy to be wrong if it gets fixed, of course...
On Fri, Oct 18, 2013 at 11:08 AM, jtikalsky notifications@github.comwrote:
Apologies, I've been out of the office most of the week.
0.0.0.0 is not an IP you can connect to. You need to use something else. Perhaps you meant 127.0.0.1?
On 10/14/2013 05:04 PM, nwoods wrote:
Maybe the -vvv option on ssh gets us somewhere. When I did that, the error message became
|~ # /tmp/softipbus-forward debug1: Connection to port 60002 forwarding to 0.0.0.0 port 60002 requested. debug2: fd 9 setting TCP_NODELAY debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug1: channel 3: new [direct-tcpip] channel 3: open failed: connect failed: debug1: channel 3: free: direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888, nchannels 4 debug3: channel 3: status: The following connections are open:
2 client-session (t4 r0 i0/0 o0/0 fd 6/7 cfd -1)
3 direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect
from 127.0.0.1 port 46888 (t3 r-1 i0/0 o0/0 fd 9/9 cfd -1)
debug3: channel 3: close_fds r 9 w 9 e -1 c -1 |
I don't know what that means, but I'm sure someone does...
I connected to the CTP with the command
ssh -vvv -L 60002:0.0.0.0:60002 root@192.168.1.31 — Reply to this email directly or view it on GitHub <https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26292763 .
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608141 .
I can't say you're WRONG, but connecting to 0.0.0.0 is really Not something that should work.
0.0.0.0 is the special case address for 'listen on all addresses'. It's not an address a system should normally be expected to respond to.
If you want to forward port 60002 on your system to 60002 on the card, 127.0.0.1 is the proper IP address for this. "-L60002:127.0.0.1:60002 translates to "Take connections on 60002 locally, forward them to the remote endpoint, then connect them to 127.0.0.1 (localhost), port 60002". I've no idea why 0.0.0.0 worked for you, honestly.
On 10/18/2013 11:12 AM, nwoods wrote:
We've been using 0.0.0.0 for quite a while (always from Ayinger, which might change it) and it's seemed to work for ssh, etc. The TCP parts of this seem to be working, so I doubt that's the problem. Happy to be wrong if it gets fixed, of course...
On Fri, Oct 18, 2013 at 11:08 AM, jtikalsky notifications@github.comwrote:
Apologies, I've been out of the office most of the week.
0.0.0.0 is not an IP you can connect to. You need to use something else. Perhaps you meant 127.0.0.1?
On 10/14/2013 05:04 PM, nwoods wrote:
Maybe the -vvv option on ssh gets us somewhere. When I did that, the error message became
|~ # /tmp/softipbus-forward debug1: Connection to port 60002 forwarding to 0.0.0.0 port 60002 requested. debug2: fd 9 setting TCP_NODELAY debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug1: channel 3: new [direct-tcpip] channel 3: open failed: connect failed: debug1: channel 3: free: direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888, nchannels 4 debug3: channel 3: status: The following connections are open:
2 client-session (t4 r0 i0/0 o0/0 fd 6/7 cfd -1)
3 direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect
from 127.0.0.1 port 46888 (t3 r-1 i0/0 o0/0 fd 9/9 cfd -1)
debug3: channel 3: close_fds r 9 w 9 e -1 c -1 |
I don't know what that means, but I'm sure someone does...
I connected to the CTP with the command
ssh -vvv -L 60002:0.0.0.0:60002 root@192.168.1.31 — Reply to this email directly or view it on GitHub
<https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26292763 .
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608141
.
— Reply to this email directly or view it on GitHub https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608510.
Hmm, maybe this is the root of all our problems :) @nwoods, maybe 127.0.0.1 will help. Although I think that this probably isn't the type of thing to fail half-way.
Jes, we can compile correctly with a non-specific install of petalinux now. We jsut hardcode the path to mb-gcc in. I think this is fine, even for the long term, so no painful X-forwarded-over-the-atlantic :).
Thanks
Evan
On Fri, Oct 18, 2013 at 6:19 PM, jtikalsky notifications@github.com wrote:
I can't say you're WRONG, but connecting to 0.0.0.0 is really Not something that should work.
0.0.0.0 is the special case address for 'listen on all addresses'. It's not an address a system should normally be expected to respond to.
If you want to forward port 60002 on your system to 60002 on the card, 127.0.0.1 is the proper IP address for this. "-L60002:127.0.0.1:60002 translates to "Take connections on 60002 locally, forward them to the remote endpoint, then connect them to 127.0.0.1 (localhost), port 60002". I've no idea why 0.0.0.0 worked for you, honestly.
On 10/18/2013 11:12 AM, nwoods wrote:
We've been using 0.0.0.0 for quite a while (always from Ayinger, which might change it) and it's seemed to work for ssh, etc. The TCP parts of this seem to be working, so I doubt that's the problem. Happy to be wrong if it gets fixed, of course...
On Fri, Oct 18, 2013 at 11:08 AM, jtikalsky notifications@github.comwrote:
Apologies, I've been out of the office most of the week.
0.0.0.0 is not an IP you can connect to. You need to use something else. Perhaps you meant 127.0.0.1?
On 10/14/2013 05:04 PM, nwoods wrote:
Maybe the -vvv option on ssh gets us somewhere. When I did that, the error message became
|~ # /tmp/softipbus-forward debug1: Connection to port 60002 forwarding to 0.0.0.0 port 60002 requested. debug2: fd 9 setting TCP_NODELAY debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug1: channel 3: new [direct-tcpip] channel 3: open failed: connect failed: debug1: channel 3: free: direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect from 127.0.0.1 port 46888, nchannels 4 debug3: channel 3: status: The following connections are open:
2 client-session (t4 r0 i0/0 o0/0 fd 6/7 cfd -1)
3 direct-tcpip: listening port 60002 for 0.0.0.0 port 60002, connect
from 127.0.0.1 port 46888 (t3 r-1 i0/0 o0/0 fd 9/9 cfd -1)
debug3: channel 3: close_fds r 9 w 9 e -1 c -1 |
I don't know what that means, but I'm sure someone does...
I connected to the CTP with the command
ssh -vvv -L 60002:0.0.0.0:60002 root@192.168.1.31 — Reply to this email directly or view it on GitHub
<https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26292763 .
— Reply to this email directly or view it on GitHub< https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608141>
.
— Reply to this email directly or view it on GitHub <https://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26608510 .
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/cms-calo-layer1/issues/4#issuecomment-26609086 .
@ekfriis 0.0.0.0 isn't the issue. changing it to 127.0.0.1 produces the same error as we have been discussing.
I think we are reaching a conclusion in the monster successor of this monster thread, see https://github.com/uwcms/cms-calo-layer1/issues/26#issuecomment-26906699
You need to:
make upload
in its directory)[1] https://svnweb.cern.ch/trac/cactus/browser/trunk/cactuscore/softipbus/README.md [2] https://github.com/uwcms/ctp6commander