simh / simh

The Computer History Simulation Project
http://simh.trailing-edge.com
Other
1.66k stars 304 forks source link

VAXstation3100m38: LANCE Ethernet large packet retransmission #698

Open agn453 opened 5 years ago

agn453 commented 5 years ago

Hello Mark (and Matt).

This is going to be a bit of a story...

I'm trying to nail down an issue with "spurious dots for displayed space characters" and corrupted display titles from the GPX graphics console using the vaxstation3100m38 emulator (OpenVMS VAX V7.3 with DECwindows MOTIF V1.2-5 and Multinet V5.5).

To convince myself that it's a GPX issue and not something with the DECwindows MOTIF installation I thought I'd try connecting using Multinet V5.5's XDM session manager using a Xcursion V7.2.177 running under Windows-XP in a VirtualBox emulation. I've previously used this environment to access a DECwindows MOTIF environment as a remote display from a SIMH VAX running on a Raspberry Pi.

I am unable to connect to the DECwindows MOTIF login screen on the vaxstation3100m38 emulator from Xcursion. It partially displays the login screen (without the "COMPAQ" logo from DECwindows MOTIF V1.2-5), delays a bit then disconnects.

I have compared the Multinet V5.5 configuration with a working system that I use with a SIMH VAXserver 3900 emulation and it all seems configured correctly.

Examining the Multinet V5.5's "multinet show/conn=proc/send=name" output, I see a large number of unacknowledged SndQ items to the X11 TCP/IP port on the emulated VAXstation. This usually indicates a connectivity issue for TCP.

Firing up Wireshark on my Mac and capturing the packets between the Xcursion system and the emulated VAX, and I see TCP duplicate ACKs, TCP Out-of-Order, retransmitted packets (when the packet size is 1498 bytes) to the TCP/IP X11 port (6000). Small packets and other protocols (eg. DECnet and LAT) don't seem to be affected.

I've built the vaxstation3100m38 emulator under Microsoft Visual Studio 2015 (Community version) and am running it under a debugging instance on a PC running Windows 10 1903. However, the "set xs debug" capabilty in the vaxstation3100m38 emulator seems to be not implemented so I'm stuck trying to compare what should have been sent to the Ethernet with the Wireshark trace.

I realise Matt Burke has probably much more on his to-do list for the new emulators - so I don't expect this issue to be remedied soon. I can offer help with testing and debugging and can provide the Wireshark trace if you need see it too. My level of expertise with the LANCE hardware and emulation is close to nil though!

sim> show version
VAXstation 3100 M38/GPX (KA42-B) simulator V4.0-0 Current
    Simulator Framework Capabilities:
        64b data
        64b addresses
        Threaded Ethernet Packet transports:PCAP:NAT:UDP
        Idle/Throttling support is available
        Virtual Hard Disk (VHD) support
        RAW disk and CD/DVD ROM support
        Asynchronous I/O support (Lock free asynchronous event queue)
        Asynchronous Clock support
        FrontPanel API Version 12
    Host Platform:
        Compiler: Microsoft Visual C++ 19.00.24215.01
        Simulator Compiled as C arch: x86 (Debug Build) on May  6 2019 at 09:23:46
        Memory Access: Little Endian
        Memory Pointer Size: 32 bits
        Large File (>2GB) support
        SDL Video support: SDL Version 2.0.8
        PCRE RegEx (Version 8.36 2014-09-26) support for EXPECT commands
        OS clock resolution: 1ms
        Time taken by msleep(1): 1ms
        OS: Microsoft Windows [Version 10.0.18362.86]
        Architecture: x86 on AMD64, Processors: 8
        Processor Id: Intel64 Family 6 Model 30 Stepping 5, GenuineIntel, Level: 6, Revision: 1e05
        git commit id: 287655da
        git commit time: 2019-05-04T14:12:42-07:00

Debug build under Microsoft Visual Studio Community 2015

9track commented 5 years ago

I managed to replicate this issue with a slightly different setup. I'm running on Debian 9.5 and using Xephyr for the X server. Under VMS I'm using TCP/IP services rather than Multinet. I seem to able to get further because I can login to DECwindows and launch applications. It all appears to be running normally but when I check Wireshark I can see the duplicate ACKs and re-transmissions.

Looking through vax_xs.c I found a number of obvious problems, particularly with chaining of larger packets, however having fixed these it doesn't seem to have made any difference. The debug for the LANCE should work (although I found problems there too). Did you remember to specify the destination for the debug messages?

sim> set console debug=xs_debug.log
sim> set xs debug

or you can even use

sim> set console debug=stdout
agn453 commented 5 years ago

Thanks Matt.

I looked into this during the weekend too and found similar symptoms when using ftp with large files - so it's not just X11 traffic.

What I think is happening is a burst of large received packets is overrunning the simulated receive buffering faster than the emulation can process them. I've only just started comparing the code logic in vax_xs.c with that in the pdp11_xu.c (DEUNA/DELUA) and pdp11_xq.c (DEQNA/DELQA) modules since these don't cause similar symptoms with the VAXserver 3900 and VAX-11/780 emulators. Perhaps some of the changes in the latter need to be implemented in the vax_xs module too. Hopefully I'll be able to look more into this by the end of the week.

Regarding debugging - I was trying set xs debug as I've not done debugging of this kind with SIMH in a very long time (expecting output via stderr to the command console by default). Then I discovered the set debug debug.log command and it all sprang to life!

Also, thanks for the pointers you provided on the simh mailing list to the LANCE AM2990 datasheets and the VAXstation 2000 Technical Manual.

markpizz commented 5 years ago

sim> set console debug=stdout

Wow. I haven't seen or used that command syntax forever. The "set console debug=XXXX" form of the equivalent "SET DEBUG XXXX" command.

The earlier syntax still works, but it is deprecated.

Looking through the vax_xs code, I see the potential for incoming packet corruption and/or packet loss. Specifically, the loop which fills buffer descriptors in system memory stops filling them when all of them are full. This is good. What is not good, is that if only part of a packet has made it into some buffer descriptors, but the whole packet hasn't been completed, the received packet which had partial data is discarded and the partial packet is left in the received buffer descriptors. How the OS would make sense of the partial packet is one question, and the dropped partial packet would certainly be require TCP to recover it.

Unlike the hardware in early Ethernet devices, there is no rush to empty received buffers from the sim_ether layer. Real hardware could easily miss incoming back to back packets quite often. The sim_ether layer will not loose data unless it isn't serviced for at least seconds.

Once a packet has been received via eth_read(), care should be taken to make sure that all of that data makes it into the simulated OS memory without discarding any packets or any part of a packet. If the OS driver isn't capable of picking up part of a packet and then realize that the receive descriptor list has been completely consumed, and it has to wait for the rest of a partial packet, then the logic which fills receive buffer descriptors needs to look ahead at the available receive buffers to make sure that sufficient space is available before it inserts any part of a packet.

In order to keep packets flowing while they're really available, the scheduling of xs_svc() should be either 1) short duration waits when receive data is available and things are just waiting for a recieve buffer descriptor, or 2) clock synchronized polling when data hasn't recently been flowing.

agn453 commented 5 years ago

Unfortunately I haven't had time to look at this until now.

I can confirm packet corruption on transmit from the simulator by using ftp to transfer a text file from the simulated machine (I compared MULTINET TCPDUMP output on the simulator with Wireshark's capture of the transfer). Transferring files into the simulator using ftp seems to succeed though.

The file I'm transferring is a text file of 80 character lines with the line number and space filled. You can see the Line count in the second 1498 byte packet jumps from line 23 to 207 in the attached screenshot.

Screen Shot 2019-05-30 at 11 17 28 am

I'll look at the code in vax_xs.c some more - in particular the transmit ring buffer processing.

9track commented 5 years ago

Similarly I haven't looked a this for about a week but should be able to get back to it soon. I've been looking closely at the receive side and as far as I can tell from comparing the debug against Wireshark traces, no packets are being dropped and they are all being transferred to VAX memory. I haven't looked so closely at the transmit side yet so maybe there are some problems there.

agn453 commented 5 years ago

I've gone through the receive side too and not seen anything amiss (and I've not seen the 32 entry receive ring buffer become full).

The transmit ring buffer is only 8 entries - and as far as I can see it doesn't become full either. My suspicions are around the TXR_OWN bit in the status register and how the VMS Ethernet driver determines it can transmit another frame. Matt - maybe you can investigate this from the VMS sources. Is the transmit happening too fast? Perhaps the write callback routine is where the TXR_OWN bit should be toggled (after a successful frame write).