pkoning2 / decstuff

Various bits of data and code related to Digital Equipment Corporation
BSD 2-Clause "Simplified" License
11 stars 1 forks source link

Multiple problems with NTP broadcast client #1

Open Terry-Kennedy opened 1 month ago

Terry-Kennedy commented 1 month ago

I copied your NTP.TSK file to [0,123] on RSTS/E 10.1. I then copied the New_York timezone file to [0,123]tz.dat and added "$ @[0,123]START.COM" to my system START.COM. Also, all of your other patch .CMD files were applied to the system, including the XE/XH ones. The NTP broadcast server is a Cisco router running IOS 15.7.3(M8) and has the correct time.

Shortly after system startup was complete, OPCOM displayed (greater/less than signs changed to dashes, because GitHub interprets greater than as a quote and indents it:

--------------- OMS V10.1-A 30-Aug-24 11:50 PM --------------- Message 61 from NTP, user [1,2], Detached, job 2 Time updated to 30-Aug-2024 11:50:16.98 pm (0:00), stratum 3, source dKj|W^EPW^Cz

There are three problems:

1) Despite my New_York timezone being configured, it changed the clock from the expected 7:50 PM to 11:50 PM. That's GMT.

2) There's gibberish at the end of the message. If I rebuild from sources with PDP-11 C 1.2, it is at the beginning: (j|PlZ^Fated to 30-Aug-2024 11:25:15.53 pm (0:00), stratum 3, source L

3) After the time change OPCOM message, I started getting a series of:

Event type 5.15, Receive failed Occurred 17-Aug-37 22:58:46.9 on node 35.119 (PIDP11) Circuit UNA-0 Failure reason =User buffer unavailable Ethernet header = FFFFFFFFFFFF ECF4BBECCC20 0800

on the console, until they ended with:

Event type 0.0, Event records lost Occurred 17-Aug-37 22:58:46.9 on node 35.119 (PIDP11)

Ethernet type 0800 is IPv4, and since NTP is the only thing that is trying to do IP on this system, I suspect the two are related. The Ethernet source addresses in the messages are valid.

These "Receive failed" messages happen regardless of whether I use your pre-built NTP.TSK or the one I built.

Remote access to the system is available if you need to diagnose this.

Terry-Kennedy commented 1 month ago

It has been a very long time (35+years) since I did any serious work with RSTS/E, so some of these may be my error.

Since the README.md points out that the timezone file should be transferred in binary mode, I realized that the New_York timezone file I was working with was in ASCII, and maybe this NTP utility wanted the binary part (found in /etc/localtime on modern *BSD systems). I FTP'd it from the FreeBSD box to a VMS system (in binary mode), copied it to the RSTS/E system over DECnet, and installed it in [0,123]. Trying to start NTP results in a "open: task did not link in support for requested i/o operation", because anything you get from a VMS system is going to have RMS attributes, which are optional on RSTS/E, IIRC.

So, I added a /RMS to the end of the link line in BUILD.COM to drag in RMS libraries. Unfortunately, after detaching the task dies with: ??Memory management trap 000000 071102 001742 102220 000011 160000 001666 071120 174000

That apparently isn't the solution for working with the timezone file.

A general question of mine is "How do you get a file from VMS to RSTS/E over DECnet without adding RMS attributes?" or "How do I strip off the RMS attributes of a file already on RSTS/E?"

The next question is exactly what kind of timezone file NTP.TSK wants and where it can be found.

At this point I'd be happy with a logical that defines the offset from GMT and ditching the timezone support completely if it moves things along.

pkoning2 commented 3 weeks ago

Sorry for the delayed answer. A couple of comments.

  1. The "gibberish" is the value sent by the NTP server as the "source" field. That's supposed to be a text string as far as I can tell. Apparently Cisco doesn't know that.
  2. The timezone file is a binary file. The text file you're referring to may be a "zic" source file, which is a textual listing of the history of time zone change rules for a given location. That's run through a compiler and the binary output is what NTP.TSK, and for that matter the Unix time handling library (like "localtime"), use to find the current offset. On a Linux system (and probably other Unices) you'll find the entire set of known timezone files in /usr/share/zoneinfo, and /etc/localtime is a symlink to one of those.
  3. I don't remember how I copied mine. Perhaps using "rstsflx" to transfer the file from my Linux host system directly to the emulated disk container file my RSTS system boots from. That (in block mode) will certainly do the job.

I'm thinking about dropping the C version since DEC C is way too much annoying baggage, and rewriting this utility in Forth.

Terry-Kennedy commented 3 weeks ago

Answering the above in the same order:

1) Apparently "Unix" NTP does the same thing:

(1:117) ns0:/usr/ports# ntpdc
ntpdc> version
ntpdc 4.2.8p18-a (1)

Here is the message from NTP.TSK:

>>>>>>>>>>>>>>>  OMS V10.1-A  05-Sep-24 07:14 PM  <<<<<<<<<<<<<<<
Message 3 from NTP, user [1,2], Detached, job 7
(j8\aupdated to  5-Sep-2024  7:14:02.40 pm EDT (-4:00), stratum 3, source L
2

And here is a tcpdump of the packet that triggered it:

tcpdump -s 1500 -X -vv port ntp and host ns0.ispnet.net
tcpdump: listening on ix0, link-type EN10MB (Ethernet), capture size 1500 bytes
19:14:02.392376 IP (tos 0x0, ttl 64, id 36156, offset 0, flags [none], proto UDP (17), length 76)
    ns0.ispnet.net.ntp > 204.141.35.255.ntp: [udp sum ok] NTPv2, length 48
        Broadcast, Leap indicator:  (0), Stratum 3 (secondary reference), poll 6 (64s), precision -24
        Root Delay: 0.043075, Root dispersion: 0.037429, Reference-ID: ns3.ispnet.net
          Reference Timestamp:  3934566620.381112366 (2024/09/05 19:10:20)
          Originator Timestamp: 0.000000000
          Receive Timestamp:    0.000000000
          Transmit Timestamp:   3934566842.392973423 (2024/09/05 19:14:02)
            Originator - Receive Timestamp:  0.000000000
            Originator - Transmit Timestamp: 3934566842.392973423 (2024/09/05 19:14:02)
        0x0000:  4500 004c 8d3c 0000 4011 0cc8 cc8d 2383  E..L.<..@.....#.
        0x0010:  cc8d 23ff 007b 007b 0038 6e06 1503 06e8  ..#..{.{.8n.....
        0x0020:  0000 0b07 0000 0995 cc8d 2887 ea84 b8dc  ..........(.....
        0x0030:  6190 94b2 0000 0000 0000 0000 0000 0000  a...............
        0x0040:  0000 0000 ea84 b9ba 6499 e7ca            ........d...

2) I used /etc/localtime from a FreeBSD 13.4 system, which is a timezone zic'd for my local time zone This now works and the system time is now being set to the correct time for my timezone.

3) I ended up building and using rstsflx to do this on a FreeBSD 13.4 system with an ancient gcc34 I've been dragging around for compiling old code. rstsflx does not build successfully on a Raspberry Pi using Debian Bullseye (bug filed against rstsflx over there).

x) (Not addressed in your response) I still get lots of reported DECnet "Receive failed" for broadcast IP packets after NTP is started:

Event type 5.15, Receive failed
Occurred 05-Sep-24 19:22:27.9 on node 35.119 (PIDP11)
Circuit UNA-0
Failure reason =User buffer unavailable
Ethernet header = FFFFFFFFFFFF ECF4BBECCC20 0800

Until the system finally gives up with:

Event type 0.0, Event records lost
Occurred 05-Sep-24 19:22:27.9 on node 35.119 (PIDP11)

This doesn't seem to be caused by the task exiting (which should remove the filter anyway):

$ attach 7
Attaching to job 7 
7 PIDP11::KB0    NTP+...RSX    SL           17(64)K+0K      0.2(+0.2) -4
pkoning2 commented 3 weeks ago

User buffer unavailable would happen if there are more broadcast messages on the wire than my NTP can receive and dispose of. Do you have a mad broadcaster? If not, it would have to be something else, but I don't see it here. Will keep digging. Thanks for the wire capture, I'll see what that tells me. The message parsing may be too primitive. On rstsflx, you said you filed a bug "there" -- where is that? I need to sort out the versions, I think some old ones are hanging around. I also have a Python one that is better in most ways, that needs to get published. For now you can find it here: svn://akdesign.dyndns.org/flx/trunk but I should probably move that to Github, open-simh/simtools most likely.

pkoning2 commented 3 weeks ago

Might there be broadcasts that are longer than 568 or so bytes? That would be an "oversized packet" to RSTS, I think, and while those are ignored they cause another pass through the top loop so the handling may not be quite right.

Terry-Kennedy commented 3 weeks ago

I wouldn't say there's a mad broadcaster, but there is a legitimate amount of IP broadcast traffic as this is a /24 which is pretty full of IP hosts doing ARP, etc. A "tcpdump -vv ether dst ff:ff:ff:ff:ff:ff" shows that almost all of it is expected ARP traffic, with occasional UDP broadcasts from some clients. Too sensitive to include here, but email me out-of-band if you need to see it.

"There" being what I thought was the official repository: https://github.com/simh/simtools/tree/master/extracters/rstsflx

Terry-Kennedy commented 3 weeks ago

No, the ARP requests seem to be all length 46.

pkoning2 commented 3 weeks ago

Nothing in the "simh" repository is official; that's Mark's repository. The one in open-simh/simtools is its open source replacement. What's there is actually a snapshot; right now my current code is in Subversion on a server I keep, as I mentioned above for the Python version. The V2.6 code is in the same place, but in branches/V2.6 instead of "trunk". It's time to move them to Github, though.

What, roughly, would you say the rate of broadcasts is? 100 per second? Less than one per second?

I found the "garbage" issue. The "refid" field is a string for stratum 1, which is what my time server is (I have a GPS source). But for stratum 3 it's an IP address, and I missed that detail. Will fix. As I mentioned, my inclination is to drop this C version and do a Forth version instead, now that my Forth has been refreshed recently.

Terry-Kennedy commented 3 weeks ago

Yeah, there's the "Will the real simh please stand up?" problem... Anyway, they seem to be the same except for a fix in the simh repo but not in the open-simh one.

Broadcast counts per second, over a 10-second period: 24, 14, 21, 29, 15, 13, 24, 27, 26, 19. Almost entirely ARP requests.

Would it be possible to fix the garbage problem in the C version? Do you think performance (ability to receive and dispose of uninteresting broadcast packets) would be better or worse in Forth?