Open RaymiiOrg opened 3 years ago
Wow, you got quite far with DECwindows - my attempts always ended up crashing OpenVMS.
About the issue - I don't see much of a chance of progressing with component bugs before they can be isolated in unit tests, which requires refactoring the code (especially getting rid of global variable use in classes), so I'll try to get that started.
One semi-random thought: As a workaround for #24, which could be the root cause of this issue, you can try setting the delay (sleep) in CDEC21143::run()
to a lower value.
Yesterday evening I tried a few things:
Without the lost packet error, there is still slowness and crashes when x11 forwarding, most often now the programs just hang and crash with an error;
XIO: fatal IO error 65535 (network partner disconnected logical link) on X server "_WSA1:"
after 685 requests (614 known processed) with 108 events remaining.
%XLIB-F-IOERROR, xlib io error
It does seem that both Xehpyr and XNest get slower over time. If openvms has just booted, it goes well for a few minutes. But the longer it runs, the slower it responds, hangs, etc.
I'm going to see if I can get netbsd running and try x forwarding there, maybe it makes a difference or help narrow down issues.
Without the sleep (commented out) in the thread, the SRM console gave other errors:
Testing the EW* Network*** Error (ewa0),
Mop loop message timed out from: 08-00-2b-3b-42-fd***
List index: 7 received count: 0 expected count 2
Networking did work however inside OpenVMS. I'm also going to see if this issue (slowness and crashes) happen on actual hardware (outside of virtualbox) with 2 interfaces.
Testing the EW* Network Error (ewa0), Mop loop message timed out from: 08-00-2b-3b-42-fd List index: 7 received count: 0 expected count 2
I'm going to see if I can get netbsd running and try x forwarding there, maybe it makes a difference or help narrow down issues.
Saw a report on Twitter stating x and ssh work with netbsd but are "slow", depending on the hardware used:
I still have to test with actual hardware, will report back in later.
Got the same problem, just when copying files (with decnet via TCPIP) to the axpbox machine. I'm running axpbox on a Fedora33 machine with more than one nic. What I noticed is that in the package lost message a Mac-adress appears as src that I do not know and cannot be traced on our network.
Probably related : when I leave axpbox with OpenVMS booted running overnight somewhere in the night it start giving every second(?) the package loss message with some mac-adresses, which I do not know as src and FF-FF-FF-FF-FF-FF as dst.
Networking did work however inside OpenVMS. I'm also going to see if this issue (slowness and crashes) happen on actual hardware (outside of virtualbox) with 2 interfaces.
Can confirm this issue also happens without virtualbox. Two (gigabit) NIC's, one for AXPbox (openVMS) and one for the PC, networking does work, but X11 has the same slowness. Lost packet messages also appear (but I suspect that is due to gigabit).
Looks like DecNET is much more stable than TCPIP. I'm running already for more than 1.5 hours 4 X11applications (ICO,DecW$clock,Decw$mail and vue$master) and have them displaying on a "real" alpha runing OpenVMS.
I see the same instability when copying files to axpbox : (by "decnet" or "decnet via TCPIP") copy .c 19.10"user passw"::[] works OK copy .com 10.9.9.9"user passw"::[] hangs after a few files.
Very interesting. Once I have a while to do work on AXPbox I'll try to look into this - all this information will definitely help.
the package loss message with some mac-adresses, which I do not know as src and FF-FF-FF-FF-FF-FF as dst.
This looks suspiciously like ARP broadcast messages (who-has xxx tell yyy). It might be another device openvms is trying to communicate with. Are they bigger than the 1514 size? That seems large for ARP requests....
Looks like DecNET is much more stable than TCPIP. I'm running already for more than 1.5 hours 4 X11applications (ICO,DecW$clock,Decw$mail and vue$master) and have them displaying on a "real" alpha runing OpenVMS.
Out of personal interest, could you maybe share screenshots of vue, clock and mail? I only get those halfway rendering...
The experiments, Ireported yesterday were with a modified version of axpbox : I raised the 1514 to 9000 both in DEC21143.cpp as in Ethernet.cpp. Have to do more test with this.
I'm wondering why ETH_MAX_PACKET_RAW is defined in Ethernet.h but is never used. I think the hard coded 1514 in the .cpp files should be replaced by this one.
I'm wondering why ETH_MAX_PACKET_RAW is defined in Ethernet.h but is never used. I think the hard coded 1514 in the .cpp files should be replaced by this one.
1514 is around the maximum Ethernet protocol frame length (the exact length depends on the protocol type). As you see in Ethernet.h, 1514/1518 (the first being with CRC, the second one without it) length is used, corresponding to this format:
#define ETH_MAX_PACKET_RAW 1514
#define ETH_MAX_PACKET_CRC 1518
struct eth_frame { // ethernet (wire) frame
u8 src[6]; // source address
u8 dst[6]; // destination address
u8 protocol[2]; // protocol
u8 data[1500]; // data: variable 46-1500 bytes
u8 crc_fill[4]; // space for max packet crc
};
struct eth_packet { // ethernet packet
int len; // size of packet
int used; // bytes used (consumed)
u8 frame[ETH_MAX_PACKET_CRC]; // ethernet frame
};
I'll check both the Ethernet and DEC21143 implementation and try to find any bugs, also doing some small refactoring in the process (like replacing the constants with macros as you mentioned).
So the large packets causing the warning are read from pcap. Setting pcap's snaplen to ETH_MAX_PACKET_CRC
removes the warning, but the issue with network instability after some time persists - this makes sense, cause it truncates the packets that are too long instead of fragmenting them.
Is this related: https://github.com/the-tcpdump-group/tcpdump/issues/389 - or is there an option in pcap to (re-)assemble packets for us? Back when I worked at an ISP we often had "issues" relating to https://en.wikipedia.org/wiki/Large_send_offload - nowdays there even is OpenStack documentation on it: https://docs.openstack.org/developer/performance-docs/test_plans/hardware_features/hardware_offloads/plan.html
The tcpdump issue is a different one - it concerns packets over 64 kB. Here the problem is packets larger than ETH_MAX_PACKET_CRC
(1518 B) are captured by pcap, likely due to large send offload as you say (libcap doesn't fragment the packets, see https://packetbomb.com/how-can-the-packet-size-be-greater-than-the-mtu/).
I'm not however sure whether this is related to the networking slowing down, which could be a problem in the emulated NIC itself.
The tcpdump issue is a different one - it concerns packets over 64 kB. Here the problem is packets larger than
ETH_MAX_PACKET_CRC
(1518 B) are captured by pcap, likely due to large send offload as you say (libcap doesn't fragment the packets, see https://packetbomb.com/how-can-the-packet-size-be-greater-than-the-mtu/).
I did found a patch for libpcap and fragmentation: https://seclists.org/tcpdump/2007/q2/112
Can I help you in any way with testing specific things?
I did found a patch for libpcap and fragmentation: https://seclists.org/tcpdump/2007/q2/112
Interesting. This does the opposite to what we want here, though. A fragmentation function will have to be added to Ethernet.cpp to support the large packets generated by the Linux networking stack.
Can I help you in any way with testing specific things?
Currently no patch exists, so there's nothing to test. I'll let you know once I get to something.
Based on looking at the simh pcap networking implementation LSO is solved there. Maybe porting the entire network emulation from simh would be a reasonable choice.
Sure 1500 is the normal frame size, but not when some interfaces are set to "jumbo frames" than the limit is I think just under 9000.
The machine, when package size is set to 9000 survived the weekend. However -the VMS-clock stopped ticking at friday night 18.45h (sh tim gives always the same time) -the console (Putty session) hangs. last message is from friday 18.37h
I did some debugging and came up with the a patch (attached), which makes the network more stable. I was able to start CDE environment (see screenshot).
I did some debugging and came up with the a patch (attached), which makes the network more stable. I was able to start CDE environment (see screenshot).
![]()
For personal interest, could you maybe explain a bit what the patch does?
For personal interest, could you maybe explain a bit what the patch does?
What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.
For personal interest, could you maybe explain a bit what the patch does?
What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.
Thank you for explaining! I'm going to try it as well.
What is your networking setup? The screenshot looks like os x, do you use a virtual machine?
I do use a virtual machine running on ESXi and yes, I do connect from an OS X machine.
For personal interest, could you maybe explain a bit what the patch does?
What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.
Thank you for explaining! I'm going to try it as well.
What is your networking setup? The screenshot looks like os x, do you use a virtual machine?
Yes, I'm using FreeBSD virtual machine on ESXi. On this virtual machine I run AXPbox. And yes, I connect from OS X to AXPbox.
With the patch enabled I get new (error) messages in the SRM prompt:
Testing the System
Testing the Network
*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef
*** List index: 0 received count: 3 expected count 4
*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef
*** List index: 1 received count: 3 expected count 4
*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef
*** List index: 2 received count: 2 expected count 4
*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef
*** List index: 3 received count: 2 expected count 4
*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef
*** List index: 4 received count: 2 expected count 4
*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef
*** List index: 5 received count: 2 expected count 4
Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef
It took me a while to get up and running because I forgot the SYSTEM password. Fixed that: https://gist.github.com/RaymiiOrg/d70258c698857659f4fadfa282556ae8 - now able to test the patch in OpenVMS.
This is the branch I'm testing with: https://github.com/RaymiiOrg/axpbox/tree/combine_tdes2_tdes3_buffer_for_valid_ethernet_frame - If you don't want to create a pull request I could do that for you as well, for Tomáš to review.
I can confirm that the most of my tests in the first topic now run much better (mail, vue, clock):
mcr decw$clock
EDIT/TPU/DISPLAY=DECWINDOWS
Trying a CDE session (run sys$system:decw$startlogin.exe
) does take a while to load, but it loads!
Lots of looking at the hourglass cursor. The CPacketQueue(rx_queue):add()
are gone though.
Looks promising! Doesn't get any further than the blue screen, but still, specific applications do work quite well:
For my own reference:
RUN SYS$SYSTEM:DECW$CALC
RUN SYS$SYSTEM:DECW$CALENDAR
RUN SYS$SYSTEM:DECW$CARDFILER
RUN SYS$SYSTEM:DECW$CLOCK
VIEW/INTERFACE=DECWINDOWS filename
RUN SYS$SYSTEM:DECSOUND
CREATE/TERMINAL=DECTERM
EDIT/TPU/DISPLAY=DECWINDOWS
RUN SYS$SYSTEM:VUE$MASTER
RUN SYS$SYSTEM:DECW$MAIL
RUN SYS$SYSTEM:DECW$MESSAGEPANEL
RUN SYS$SYSTEM:DECW$NOTEPAD
RUN SYS$SYSTEM:DECW$PRINTSCREEN
RUN SYS$SYSTEM:DECW$PAINT
RUN SYS$SYSTEM:DECW$PUZZLE
RUN SYS$SYSTEM:DECW$BOOKREADER
Via: https://vmssoftware.com/products/decwindows-motif/ - Using DECwindows Motif for OpenVMS
I'm glad that it works for you as well. The new error messages you're seeing happen sometimes - and from what I've observed have nothing to do with the patch. I just started AXPbox and I didn't see the errors. The CPacketQueue(rx_queue):add() error isn't entirely fixed. When there's heavy network use it can happen again. The patch improves the overall network stability because fewer retransmits are sent to the network. I'll try to find a way to improve the CPacketQueue(rx_queue):add() error situation, though.
If you don't mind, could you please do the pull request for me - Thanks a million
If you don't mind, could you please do the pull request for me - Thanks a million
Did that here: #60 .
Could you tell me a bit more on your setup? What are the VM specs in esxi, what nics are emulated and what openvms version are you running? (8.3 or 8.4 vsi). I can't get the cde session to start, after login and the blue starting screen , nothing happens.
The VM I use to run AXPBox, along with a VAX on simh and a SPARC station on qemu has 4 CPUs, 8GB RAM and two NICs. Adapter type of the NICs is E1000. I use one NIC exclusively for AXPBox. The VAX and the SPARC station share the other NIC.
I use openvms 8.4 vsi.
I connected via Xnest :1 -ac -query
I connected via Xnest :1 -ac -query from my Mac using Xquartz as the Xserver. I don’t get DECW$STARTLOGIN.EXE as the login manager. Some ancient stripped down login manager is used ( just a white box with username/passwd fields in it). When I login I don’t see the blue window you’re getting.
Thanks for the information. My vm has way less resources so that is something to look at.
How do you start the desktop environment if not via startlogin.exe?
Xdmcp. When you start the XServer with the -query option you should get a login window. After you login dtsession starts which then loads CDE.
I'm starting Xnest with following command
Xnest :13 -ac -full -query AXP-VIE-01
After some time I get this login screen:
Hmm I'm not coming any further. These are my commands. Local X server:
Xephyr -screen 1024x786 -ac -query 0.0.0.0 :1
On OpenVMS:
$ set display/create/node=x.x.x.x/transport=tcpip/server=1
$ show log cde$sessionmain
"CDE$SESSIONMAIN" = "mcr cde$system_defaults:[bin]dtsession" (LNM$SYSTEM_TABLE)
$ mcr cde$system_defaults:[bin]dtsession
Then I get a different screen in the Xserver, cursor seems to work:
After a while that just goes back to black (also no cursor):
This error popped up in the OpenVMS console:
-> -SYSTEM-F-LINKDISCON, network partner disconnected logical link
The local X server also logs an error:
XDM: too many retransmissions, declaring session dead
The -query parameter needs the IP address or name of your AXP box not your local Xserver. Like so
Xephyr -screen 1024x786 -ac -query **<IP address of AXP box>** :1
The -query parameter needs the IP address or name of your AXP box not your local Xserver. Like so
Xephyr -screen 1024x786 -ac -query **<IP address of AXP box>** :1
That gives me the same behaviour, first the other coloured background, then an XDM: too many retransmissions, declaring session dead
and the black screen. Both nodes are in the same network and can ping one another:
Are the dtsession
startup commands correct?
Using these commands I was able to login into CDE:
linux $ Xephyr -screen 1024x768 :1 -ac -listen tcp
vms $ set disp/creat/node=192.168.122.1/trans=tcpip/server=1/exec
vms $ run sys$system:decw$startlogin
I had my virtualbox set to emulate a PCNET 10/100 mbit network card. When I changed that to an intel gbit desktop one, I was able to start the GUI as well with the above commands. The packet lost errors are back then however, but the UI does work. So cool, very exciting!
Version 7.3 on Alpha seems a bit more responsive on the GUI side:
Even gaming is possible and quite responive 😀
Tetris from here: https://www.digiater.nl/openvms/freeware/v40/tetris312/
You can then perhaps try to play Doom on OpenVMS :) I had tons of fun doing this a while ago https://astr0baby.wordpress.com/2019/03/07/compiling-prboom-on-openvms-8-4-alpha/
Time to test AXPbox on the Nvidia Jetson Nano some more - this is exciting news :)
Same problem - no progress :( Downloaded and compiled current version and can't get DECW* anything as it always times-out :(
Saw the below error including unstable / slow network while testing DECwindows via X forwarding.
On a debian machine in the same network (no firewalls in between), started X server:
Setup the remote display in openvms:
(10.0.2.15 is the debian vm, 1 is the x display (:1)
Started an application on OpenVMS:
Or multiple, mail & file manager:
Most often it works speedly:
The ICO program moves smoothly, but cannot show that on a screenshot:
But after a few minutes, it became quite slow, even crashing:
This was the x servers output:
On the AXPbox command line window: