Slow networking - Githubissues

RaymiiOrg commented 3 years ago

Saw the below error including unstable / slow network while testing DECwindows via X forwarding.

On a debian machine in the same network (no firewalls in between), started X server:

Xephyr -screen 1024x786 -ac -query 0.0.0.0 :1

Setup the remote display in openvms:

set display/create/node=10.0.2.15/transport=tcpip/server=1

(10.0.2.15 is the debian vm, 1 is the x display (:1)

Started an application on OpenVMS:

RUN DECW$EXAMPLES:ICO.EXE

Or multiple, mail & file manager:

SPAWN/NOWAIT/INPUT=NL: RUN SYS$SYSTEM:DECW$MAIL.EXE
SPAWN/NOWAIT/INPUT=NL: RUN SYS$SYSTEM:VUE$MASTER

afbeelding

Most often it works speedly:

afbeelding

The ICO program moves smoothly, but cannot show that on a screenshot:

afbeelding

But after a few minutes, it became quite slow, even crashing:

afbeelding

This was the x servers output:

afbeelding

On the AXPbox command line window:


CPacketQueue(rx_queue):add() packet lost! Size = 4314.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02

lenticularis39 commented 3 years ago

Wow, you got quite far with DECwindows - my attempts always ended up crashing OpenVMS.

About the issue - I don't see much of a chance of progressing with component bugs before they can be isolated in unit tests, which requires refactoring the code (especially getting rid of global variable use in classes), so I'll try to get that started.

lenticularis39 commented 3 years ago

One semi-random thought: As a workaround for #24, which could be the root cause of this issue, you can try setting the delay (sleep) in CDEC21143::run() to a lower value.

RaymiiOrg commented 3 years ago

Yesterday evening I tried a few things:

A lower delay (every item from 10 to 1) does not make the network go faster nor the error go away
The packet size is checked in Ethernet.cpp::add_tail (cant be more than 1514) and it seems it's just too large of a packet. Tried to change a few openvms parameters related to MTU, did not succeed. That ethernet code also checks for a config parameter (queue), defined that in es40.cfg to 1024, didn't help (didn't expect it to help since the size is too large).
What made the packetqueue packet lost error go away is changing the network adapter to a 100mbit model instead of a gigabit adapter in virtualbox.

Without the lost packet error, there is still slowness and crashes when x11 forwarding, most often now the programs just hang and crash with an error;

XIO:  fatal IO error 65535 (network partner disconnected logical link) on X server "_WSA1:"
      after 685 requests (614 known processed) with 108 events remaining. 
%XLIB-F-IOERROR, xlib io error

It does seem that both Xehpyr and XNest get slower over time. If openvms has just booted, it goes well for a few minutes. But the longer it runs, the slower it responds, hangs, etc.

I'm going to see if I can get netbsd running and try x forwarding there, maybe it makes a difference or help narrow down issues.

RaymiiOrg commented 3 years ago

Without the sleep (commented out) in the thread, the SRM console gave other errors:

Testing the EW* Network*** Error (ewa0),
 Mop loop message timed out from: 08-00-2b-3b-42-fd*** 
List index: 7 received count: 0 expected count 2

Networking did work however inside OpenVMS. I'm also going to see if this issue (slowness and crashes) happen on actual hardware (outside of virtualbox) with 2 interfaces.

Testing the EW* Network Error (ewa0), Mop loop message timed out from: 08-00-2b-3b-42-fd List index: 7 received count: 0 expected count 2

RaymiiOrg commented 3 years ago

I'm going to see if I can get netbsd running and try x forwarding there, maybe it makes a difference or help narrow down issues.

Saw a report on Twitter stating x and ssh work with netbsd but are "slow", depending on the hardware used: 550B3047-6890-455D-AD3C-FD992E54B9A4

I still have to test with actual hardware, will report back in later.

joukj commented 3 years ago

Got the same problem, just when copying files (with decnet via TCPIP) to the axpbox machine. I'm running axpbox on a Fedora33 machine with more than one nic. What I noticed is that in the package lost message a Mac-adress appears as src that I do not know and cannot be traced on our network.

Probably related : when I leave axpbox with OpenVMS booted running overnight somewhere in the night it start giving every second(?) the package loss message with some mac-adresses, which I do not know as src and FF-FF-FF-FF-FF-FF as dst.

RaymiiOrg commented 3 years ago

Networking did work however inside OpenVMS. I'm also going to see if this issue (slowness and crashes) happen on actual hardware (outside of virtualbox) with 2 interfaces.

Can confirm this issue also happens without virtualbox. Two (gigabit) NIC's, one for AXPbox (openVMS) and one for the PC, networking does work, but X11 has the same slowness. Lost packet messages also appear (but I suspect that is due to gigabit).

joukj commented 3 years ago

Looks like DecNET is much more stable than TCPIP. I'm running already for more than 1.5 hours 4 X11applications (ICO,DecW$clock,Decw$mail and vue$master) and have them displaying on a "real" alpha runing OpenVMS.

I see the same instability when copying files to axpbox : (by "decnet" or "decnet via TCPIP") copy .c 19.10"user passw"::[] works OK copy .com 10.9.9.9"user passw"::[] hangs after a few files.

lenticularis39 commented 3 years ago

Very interesting. Once I have a while to do work on AXPbox I'll try to look into this - all this information will definitely help.

RaymiiOrg commented 3 years ago

the package loss message with some mac-adresses, which I do not know as src and FF-FF-FF-FF-FF-FF as dst.

This looks suspiciously like ARP broadcast messages (who-has xxx tell yyy). It might be another device openvms is trying to communicate with. Are they bigger than the 1514 size? That seems large for ARP requests....

RaymiiOrg commented 3 years ago

Looks like DecNET is much more stable than TCPIP. I'm running already for more than 1.5 hours 4 X11applications (ICO,DecW$clock,Decw$mail and vue$master) and have them displaying on a "real" alpha runing OpenVMS.

Out of personal interest, could you maybe share screenshots of vue, clock and mail? I only get those halfway rendering...

joukj commented 3 years ago

The experiments, Ireported yesterday were with a modified version of axpbox : I raised the 1514 to 9000 both in DEC21143.cpp as in Ethernet.cpp. Have to do more test with this.

I'm wondering why ETH_MAX_PACKET_RAW is defined in Ethernet.h but is never used. I think the hard coded 1514 in the .cpp files should be replaced by this one.

lenticularis39 commented 3 years ago

I'm wondering why ETH_MAX_PACKET_RAW is defined in Ethernet.h but is never used. I think the hard coded 1514 in the .cpp files should be replaced by this one.

1514 is around the maximum Ethernet protocol frame length (the exact length depends on the protocol type). As you see in Ethernet.h, 1514/1518 (the first being with CRC, the second one without it) length is used, corresponding to this format:

obrazek

#define ETH_MAX_PACKET_RAW 1514
#define ETH_MAX_PACKET_CRC 1518

struct eth_frame { // ethernet (wire) frame
  u8 src[6];       // source address
  u8 dst[6];       // destination address
  u8 protocol[2];  // protocol
  u8 data[1500];   // data: variable 46-1500 bytes
  u8 crc_fill[4];  // space for max packet crc
};

struct eth_packet {             // ethernet packet
  int len;                      // size of packet
  int used;                     // bytes used (consumed)
  u8 frame[ETH_MAX_PACKET_CRC]; // ethernet frame
};

I'll check both the Ethernet and DEC21143 implementation and try to find any bugs, also doing some small refactoring in the process (like replacing the constants with macros as you mentioned).

lenticularis39 commented 3 years ago

So the large packets causing the warning are read from pcap. Setting pcap's snaplen to ETH_MAX_PACKET_CRC removes the warning, but the issue with network instability after some time persists - this makes sense, cause it truncates the packets that are too long instead of fragmenting them.

RaymiiOrg commented 3 years ago

Is this related: https://github.com/the-tcpdump-group/tcpdump/issues/389 - or is there an option in pcap to (re-)assemble packets for us? Back when I worked at an ISP we often had "issues" relating to https://en.wikipedia.org/wiki/Large_send_offload - nowdays there even is OpenStack documentation on it: https://docs.openstack.org/developer/performance-docs/test_plans/hardware_features/hardware_offloads/plan.html

lenticularis39 commented 3 years ago

The tcpdump issue is a different one - it concerns packets over 64 kB. Here the problem is packets larger than ETH_MAX_PACKET_CRC (1518 B) are captured by pcap, likely due to large send offload as you say (libcap doesn't fragment the packets, see https://packetbomb.com/how-can-the-packet-size-be-greater-than-the-mtu/).

lenticularis39 commented 3 years ago

I'm not however sure whether this is related to the networking slowing down, which could be a problem in the emulated NIC itself.

RaymiiOrg commented 3 years ago

The tcpdump issue is a different one - it concerns packets over 64 kB. Here the problem is packets larger than ETH_MAX_PACKET_CRC (1518 B) are captured by pcap, likely due to large send offload as you say (libcap doesn't fragment the packets, see https://packetbomb.com/how-can-the-packet-size-be-greater-than-the-mtu/).

I did found a patch for libpcap and fragmentation: https://seclists.org/tcpdump/2007/q2/112

Can I help you in any way with testing specific things?

lenticularis39 commented 3 years ago

I did found a patch for libpcap and fragmentation: https://seclists.org/tcpdump/2007/q2/112

Interesting. This does the opposite to what we want here, though. A fragmentation function will have to be added to Ethernet.cpp to support the large packets generated by the Linux networking stack.

Can I help you in any way with testing specific things?

Currently no patch exists, so there's nothing to test. I'll let you know once I get to something.

lenticularis39 commented 3 years ago

Based on looking at the simh pcap networking implementation LSO is solved there. Maybe porting the entire network emulation from simh would be a reasonable choice.

joukj commented 3 years ago

Sure 1500 is the normal frame size, but not when some interfaces are set to "jumbo frames" than the limit is I think just under 9000.

joukj commented 3 years ago

The machine, when package size is set to 9000 survived the weekend. However -the VMS-clock stopped ticking at friday night 18.45h (sh tim gives always the same time) -the console (Putty session) hangs. last message is from friday 18.37h

dmzettl commented 3 years ago

I did some debugging and came up with the a patch (attached), which makes the network more stable. I was able to start CDE environment (see screenshot).

DEC21143.patch.txt

RaymiiOrg commented 3 years ago

I did some debugging and came up with the a patch (attached), which makes the network more stable. I was able to start CDE environment (see screenshot).

DEC21143.patch.txt

For personal interest, could you maybe explain a bit what the patch does?

dmzettl commented 3 years ago

For personal interest, could you maybe explain a bit what the patch does?

What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.

RaymiiOrg commented 3 years ago

For personal interest, could you maybe explain a bit what the patch does?

What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.

Thank you for explaining! I'm going to try it as well.

What is your networking setup? The screenshot looks like os x, do you use a virtual machine?

dmzettl commented 3 years ago

I do use a virtual machine running on ESXi and yes, I do connect from an OS X machine.

For personal interest, could you maybe explain a bit what the patch does?

What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.

Thank you for explaining! I'm going to try it as well.

What is your networking setup? The screenshot looks like os x, do you use a virtual machine?

Yes, I'm using FreeBSD virtual machine on ESXi. On this virtual machine I run AXPbox. And yes, I connect from OS X to AXPbox.

RaymiiOrg commented 3 years ago

With the patch enabled I get new (error) messages in the SRM prompt:

Testing the System
Testing the Network

*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 0 received count: 3 expected count 4

*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 1 received count: 3 expected count 4

*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 2 received count: 2 expected count 4

*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 3 received count: 2 expected count 4

*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 4 received count: 2 expected count 4

*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 5 received count: 2 expected count 4

Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef

It took me a while to get up and running because I forgot the SYSTEM password. Fixed that: https://gist.github.com/RaymiiOrg/d70258c698857659f4fadfa282556ae8 - now able to test the patch in OpenVMS.

This is the branch I'm testing with: https://github.com/RaymiiOrg/axpbox/tree/combine_tdes2_tdes3_buffer_for_valid_ethernet_frame - If you don't want to create a pull request I could do that for you as well, for Tomáš to review.

I can confirm that the most of my tests in the first topic now run much better (mail, vue, clock):

mcr decw$clock

afbeelding

EDIT/TPU/DISPLAY=DECWINDOWS

afbeelding

Trying a CDE session (run sys$system:decw$startlogin.exe) does take a while to load, but it loads!

afbeelding

Lots of looking at the hourglass cursor. The CPacketQueue(rx_queue):add() are gone though.

Looks promising! Doesn't get any further than the blue screen, but still, specific applications do work quite well:

For my own reference:

Calculator: RUN SYS$SYSTEM:DECW$CALC
Calendar: RUN SYS$SYSTEM:DECW$CALENDAR
Cardfiler: RUN SYS$SYSTEM:DECW$CARDFILER
Clock: RUN SYS$SYSTEM:DECW$CLOCK
CDA Viewer: VIEW/INTERFACE=DECWINDOWS filename
DECsound: RUN SYS$SYSTEM:DECSOUND
DECterm: CREATE/TERMINAL=DECTERM
EVE: EDIT/TPU/DISPLAY=DECWINDOWS
FileView: RUN SYS$SYSTEM:VUE$MASTER
Mail: RUN SYS$SYSTEM:DECW$MAIL
Message Panel: RUN SYS$SYSTEM:DECW$MESSAGEPANEL
Notepad: RUN SYS$SYSTEM:DECW$NOTEPAD
Print Screen: RUN SYS$SYSTEM:DECW$PRINTSCREEN
Paint: RUN SYS$SYSTEM:DECW$PAINT
Puzzle: RUN SYS$SYSTEM:DECW$PUZZLE
Bookreader: RUN SYS$SYSTEM:DECW$BOOKREADER

Via: https://vmssoftware.com/products/decwindows-motif/ - Using DECwindows Motif for OpenVMS

dmzettl commented 3 years ago

I'm glad that it works for you as well. The new error messages you're seeing happen sometimes - and from what I've observed have nothing to do with the patch. I just started AXPbox and I didn't see the errors. The CPacketQueue(rx_queue):add() error isn't entirely fixed. When there's heavy network use it can happen again. The patch improves the overall network stability because fewer retransmits are sent to the network. I'll try to find a way to improve the CPacketQueue(rx_queue):add() error situation, though.

dmzettl commented 3 years ago

If you don't mind, could you please do the pull request for me - Thanks a million

RaymiiOrg commented 3 years ago

If you don't mind, could you please do the pull request for me - Thanks a million

Did that here: #60 .

Could you tell me a bit more on your setup? What are the VM specs in esxi, what nics are emulated and what openvms version are you running? (8.3 or 8.4 vsi). I can't get the cde session to start, after login and the blue starting screen , nothing happens.

dmzettl commented 3 years ago

The VM I use to run AXPBox, along with a VAX on simh and a SPARC station on qemu has 4 CPUs, 8GB RAM and two NICs. Adapter type of the NICs is E1000. I use one NIC exclusively for AXPBox. The VAX and the SPARC station share the other NIC.

I use openvms 8.4 vsi.

I connected via Xnest :1 -ac -query from my Mac using Xquartz as the Xserver. I don’t get DECW$STARTLOGIN.EXE as the login manager. Some ancient stripped down login manager is used ( just a white box with username/passwd fields in it). When I login I don’t see the blue window you’re getting.

RaymiiOrg commented 3 years ago

I connected via Xnest :1 -ac -query from my Mac using Xquartz as the Xserver. I don’t get DECW$STARTLOGIN.EXE as the login manager. Some ancient stripped down login manager is used ( just a white box with username/passwd fields in it). When I login I don’t see the blue window you’re getting.

Thanks for the information. My vm has way less resources so that is something to look at.

How do you start the desktop environment if not via startlogin.exe?

dmzettl commented 3 years ago

Xdmcp. When you start the XServer with the -query option you should get a login window. After you login dtsession starts which then loads CDE.

dmzettl commented 3 years ago

I'm starting Xnest with following command Xnest :13 -ac -full -query AXP-VIE-01 After some time I get this login screen: Screenshot 2021-01-23 at 13 07 03

RaymiiOrg commented 3 years ago

Hmm I'm not coming any further. These are my commands. Local X server:

 Xephyr -screen 1024x786 -ac -query 0.0.0.0 :1

On OpenVMS:

$ set display/create/node=x.x.x.x/transport=tcpip/server=1

$ show log cde$sessionmain
   "CDE$SESSIONMAIN" = "mcr cde$system_defaults:[bin]dtsession" (LNM$SYSTEM_TABLE)

$ mcr cde$system_defaults:[bin]dtsession

Then I get a different screen in the Xserver, cursor seems to work:

afbeelding

After a while that just goes back to black (also no cursor):

afbeelding

This error popped up in the OpenVMS console:

-> -SYSTEM-F-LINKDISCON, network partner disconnected logical link

The local X server also logs an error:

XDM: too many retransmissions, declaring session dead

dmzettl commented 3 years ago

The -query parameter needs the IP address or name of your AXP box not your local Xserver. Like so

Xephyr -screen 1024x786 -ac -query **<IP address of AXP box>** :1

RaymiiOrg commented 3 years ago

The -query parameter needs the IP address or name of your AXP box not your local Xserver. Like so

Xephyr -screen 1024x786 -ac -query **<IP address of AXP box>** :1

That gives me the same behaviour, first the other coloured background, then an XDM: too many retransmissions, declaring session dead and the black screen. Both nodes are in the same network and can ping one another:

afbeelding

Are the dtsession startup commands correct?

lenticularis39 commented 3 years ago

Using these commands I was able to login into CDE:

linux $ Xephyr -screen 1024x768 :1 -ac -listen tcp
vms $ set disp/creat/node=192.168.122.1/trans=tcpip/server=1/exec
vms $ run sys$system:decw$startlogin

Snímek z 2021-01-24 14-06-51 Snímek z 2021-01-24 14-09-35 Snímek z 2021-01-24 14-18-41

RaymiiOrg commented 3 years ago

I had my virtualbox set to emulate a PCNET 10/100 mbit network card. When I changed that to an intel gbit desktop one, I was able to start the GUI as well with the above commands. The packet lost errors are back then however, but the UI does work. So cool, very exciting!

Version 7.3 on Alpha seems a bit more responsive on the GUI side:

afbeelding