PDP10: (KS10) no longer builds ITS

larsbrinkhoff commented 3 years ago

The scripted ITS build (https://github.com/PDP-10/its) no longer works with the PDP10 simulator (KS10). I'm testing commits from the version history since the last known working revision. I will provide more details below as I come across them.

larsbrinkhoff commented 3 years ago

This table describes my progress, and will be continuously edited.

Date	Commit	Status	Error
2018-09-11	e1db7de	good
2019-01-22	efd52d9	good
2019-05-02	758bfb7	good
2019-05-28	48db109	good
2019-05-29	6a131ec	good
2019-05-29	45d8c90	BAD	Crash in MARK <- CULPRIT
2019-05-30	564ce2b	BAD	Crash in MARK
2019-06-04	53ad66f	BAD	Crash in MARK
2019-06-23	9c2c621	BAD	Crash in MARK
2019-08-12	d6f9c58	BAD	Crash in MARK
2019-08-25	9539b62	BAD	Undefined reference to pcap_lib_version
2019-09-03	7398e63	BAD	Crash in MARK
2020-03-01	f070150	BAD	Crash in MARK
2020-07-09	e2d0095	BAD	Crash in MARK
2021-01-18	9f064db	BAD	Hangs after NITS boot

larsbrinkhoff commented 3 years ago

"Crash in MARK" looks like this. The simulator is booted off tape and runs a program to format an RP06 disk pack.

This problem was introduced in May 2019.


sim> b tu1
ITS MTBOOT.176
MARK$G'
Format pack on unit #0
Drive 0 offline
Disk 0 (unit 0) is losing.
RH-11 disk ctl status:  I = 0; Current drive is #0, which is virtual unit #0
Disk Attention Summary: 0000000000000001
Drive Status = (140400)DIFF LT 64  DRIVE-PRESENT  ERR  ATTENTION  
Ctrl & Status 1 = (104222)READY  DRIVE-AVAILABLE  SPECIAL-CONDITION  
Ctrl & Status 2 = (300)SILO-INPUT-READY  SILO-OUTPUT-READY  
Offset = (0)        
Error 1 = (40001)  ILL-FUNC  UNSAFE  
Error 2 = (0)       
Error 3 = (0)       
Word Count = 0., Unibus Address = 0
Desired Cyl = 0., Current Cyl = 0.
Desired Addr = 0  (Track 0.  Sector 0.)
Do you want to see all the UNIBUS registers  (Y or N)

markpizz commented 3 years ago

I don't know what you're doing with that table. A git bisect walks through ranges of commits to find the point where a change broke something. The only interesting result is the specific commit that caused the first failure. Several of the commits you've listed merely change comments or no things that couldn't possibly influence simulator behavior.

larsbrinkhoff commented 3 years ago

What I'm doing with the table is to post notes as I make progress.

The commit 45d8c908ba1f153522d9575e373599f1ff19bae8 introduced the "Crash in MARK" problem. Since it's related to simulation timing, I will check if a build script delay fixes the problem.

I will also continue to bisect in the later range of the table since I saw a different problem there.

markpizz commented 3 years ago

It will always be hard, if not impossible, for an external program (expect in this case) to interact precisely and consistently with a simulator. This is why SCP provides the EXPECT and SEND commands since they are precisely timed within the simulator using instruction/cycle progress. I'll be glad to add more functionality to SCP to extend this in other ways.

larsbrinkhoff commented 3 years ago

I don't think the exact timing of input is important here. Or at least it shouldn't be.

I have found that by putting in a delay where the "Crash in MARK" problem would usually appear, the error will not be triggered. However, there will still be a problem later in the process when ITS is running. So the delay is only a bandaid at best and possibly only papers over the underlying problem temporarily. Still, maybe it provides some clue.

I don't fully understand the commit 45d8c90, or why it impacts ITS and its tooling. I will work on making a small self-contained test case and post it here.

larsbrinkhoff commented 3 years ago

Here is a zip file with some inputs. Run the pdp10 simulator with the configuration error.simh. It will attach a tape to tu1 and boot from it. I have used SIMH expect/send commands to mimic the TCL expect script I use. test.zip

Now, on one of my machines this doesn't display the expected error with the latest commit 9f064db, but it does with 45d8c90. Here is a Travis CI run with 9f064db:

https://travis-ci.org/github/PDP-10/its/jobs/755157715#L1117

So there seems to be random and/or machine dependent component to this behavior.

markpizz commented 3 years ago

That "error.simh" script doesn't run without further inputs (questions answered and/or additional other attached files).

PDP-10 simulator V4.0-0 Current        git commit id: 9f064db5+uncommitted-changes
sim> do error.simh
/mnt/c/Users/Mark/Downloads/error.simh-3> at tu1 salv.tape
TU1: Tape Image 'salv.tape' scanned as SIMH format
/mnt/c/Users/Mark/Downloads/error.simh-5> at rp0 rp0.dsk
RP0: creating new file: rp0.dsk
ITS MTBOOT.176
MARK$G'
Format pack on unit #0
Are you sure you want to format pack on drive # 0 (Y or N) Y
Pack no ?

So, comparing that to the travis output seems like apples and oranges...

larsbrinkhoff commented 3 years ago

Do you mean rp0.dsk? It's not needed. What you see here is the same as my machine when running the 9f064db pdp10. Could you also try 45d8c90, please? Even though it doesn't reproduce the problem with the latest master commit, it could be interesting information. If your machine behaves the same as mine, maybe I can continue to search for a test case.

The Travis output is for the full TCL expect script, so yes it's apples and oranges. But for 45d8c90 the error message is the same on my machine.

larsbrinkhoff commented 3 years ago

To clarify, Are you sure you want to format pack on drive # 0 (Y or N) means you did not get the error.

markpizz commented 3 years ago

I get the same results on both the latest commit and 45d8c90.

markpizz commented 3 years ago

Oops. Git protected me from myself (I had some changes in the working directory and thus it didn't let me switch to 45d8c90).

It fails with 45d8c90. However, since it DOES NOT fail with the latest commit, I see little value in determining why/when it got fixed. This falls into the "if it ain't broke, don't fix (or screw with) it".

larsbrinkhoff commented 3 years ago

Thanks for testing.

However, it's not a case of "if it ain't broke, don't fix (or screw with) it". It does break, just not on your computer. Mine behave differently, and some other machines in yet another way.

larsbrinkhoff commented 3 years ago

And also, the latest commit breaks the TCL script in another place, on all machines that I have seen.

markpizz commented 3 years ago

Like I've said previously, I don't believe that an external script can reliably work. Such a script will be more reliable the slower it interacts with the simulated environment. The "distance" within the host OS and the separate processes not being in lock step will always be problematic.

larsbrinkhoff commented 3 years ago

Is it your conclusion that the build breaks because the external script is unreliable? I don't think so. The script just automates user input, and has delays in the input to simulate typing. The script worked reliably before 45d8c90, and it still works reliably on two other simulators: Cornwell's KA10/KL10, and Harrenstien's KLH10.

markpizz commented 3 years ago

The fact that there is more than one process without hard synchronization between them on systems which may have a single processor or multiple ones, precise interprocess behavior can't possibly produce the same results everywhere and each time. Things will vary based on numerous factors which may come and go on the host system.

larsbrinkhoff commented 3 years ago

Is it your opinion that the problem can't be fixed, and this issue should be closed?

markpizz commented 3 years ago

If you can demonstrate failure with SCP commands I'll dig in and fix what can be demonstrated, otherwise the TCL code is your problem to adjust where ever needed to get it to work.

Before you try and go down that path, try this change to the fei_svc routine in pdp10_fe.c and see if that changes the behavior:

t_stat fei_svc (UNIT *uptr)
{
int32 temp;

sim_clock_coschedule (uptr, tmxr_poll);                 /* continue poll */

if (M[FE_CTYIN] & FE_CVALID)                            /* previous character still pending? */
    return SCPE_OK;                                     /* wait until it gets digested */

temp = sim_poll_kbd ();                                 /* get possible char or error? */
if (temp < SCPE_KFLAG)                                  /* no char or error? */
    return temp;
if (temp & SCPE_BREAK)                                  /* ignore break */
    return SCPE_OK;
uptr->buf = temp & 0177;
uptr->pos = uptr->pos + 1;
M[FE_CTYIN] = uptr->buf | FE_CVALID;                    /* put char in mem */
apr_flg = apr_flg | APRF_CON;                           /* interrupt KS10 */
return SCPE_OK;
}

larsbrinkhoff commented 3 years ago

I tried to make an SCP script to mimic the TCL script that demonstrates the "other" (I don't know if it's the same or something else) problem. But I had some problems making EXPECT/SEND do what I wanted. I can post my script here later and maybe you can see what's wrong.

Thanks, I'll try the fei_svc code.

markpizz commented 3 years ago

My suggested code change was made without precise knowledge of how the various PDP10 operating systems actually flag that the prior received character has been received and processed. If this works better, then it proves my point about TCL driving from a separate process will absolutely be unreliable due to lack of synchronization. This change adds a degree of synchronization for just this activity in just this simulator and thus the problem remains for every other potential synchronization activity and simulator.

Your suggestion that the TCL script seemed to work reliably for some other simulators suggested that I think about how they were different and so I looked at Rich's console port simulation and he had adopted a model which I had added to several other simulators for their console I/O. Bob's original console I/O behaved as real hardware did, which presented console port data into the simulator exactly when it was received. On a real serial port characters arrived no faster than the console port speed. That gave the OS plenty of time to process it without any special need for synchronization, and if ever new data arrived before the prior data was processed, it merely overwrote the prior data (maybe - or maybe not indicating an overrun). When input to the session is arriving over some modern data path (often a socket), the network layer will endeavor to coalesce multiple characters into the same network packet. This can easily cause new data to overrun previously arrived data. The SCP EXPECT and SEND activities carefully synchronize data arrival and thus explicitly will avoid this problem. Meanwhile, you might wonder what the motivation to attempt to hold off accepting input in some simulator console I/O actually was. Well, with simulator consoles in telnet sessions OR in the context of the SCP process session, users today are more than used to copying things between windows with cut and paste. Supporting cut-and-paste is the motivation to synchronize the console input with the simulated OS.

So, did it work??

rcornwell commented 3 years ago

On the PDP10-KA and KI sim_poll_kbd is called at intervals whenever the PDP10 has no pending characters. On the KL things get rather complex depending on protocol being used. However there is a ring buffer that is 256 characters deep. However it is possible for SIMH Send to overrun this buffer if the FE is in secondary (or boot protocol). SIMH send can also overwhelm KA/KI input if the monitor in ONCE code.

Now Bob's KS10 code polls the input regardless if the CPU is ready to receive any input. This could be the problem. However this is probably what would have happened on real hardware since I do not believe that the 8080 console UART had any FIFO.

Now SCP SEND is far from perfect, it is very easy to overrun input even when they are not polling. Also characters can be dropped.

What I would like to see if you plan to enhance SCP Expect/send is something closer to Expect/Send syntax, where you either send a chunk and the allow several expects, operations.

larsbrinkhoff commented 3 years ago

@markpizz, I tested your fei_svc but it didn't make any noticeable difference.

I don't think the problem has anything to do with console I/O. I increased the typing delays in the TCL quite a lot, but still no difference.

markpizz commented 3 years ago

Now SCP SEND is far from perfect, it is very easy to overrun input even when they are not polling. Also characters can be dropped.

See HELP SEND DELAY and HELP SEND AFTER. The delay (in instructions/cycles) between characters delivered and the instructions/cycles before the first send character. The default for these is 1000, but each SEND command can have arbitrary values as needed by what is going on, and the default can be easily changed. This behavior is precise. A simulator executing instructions quickly, or slowly due to potential disk I/O waits will still deliver data precisely into the simulated system.

I don't think the problem has anything to do with console I/O. I increased the typing delays in the TCL quite a lot, but still no difference.

As long as this change still works, I'll commit it since it adds cut-and-paste support to this simulator. Right?

larsbrinkhoff commented 3 years ago

Yes, your updated fei_svc works as far as I could see.

larsbrinkhoff commented 3 years ago

The TCL script is careful to limit the rate of console input. I tried to increase the delay between characters to a rather high value, but the failure still happened. This is why I don't think the problem is related to console I/O.

When I experimented with a SIMH script I noticed the same SEND DELAY parameter resulted in different speeds in the "old" working commit compared to the latest commit. That makes me wonder if there has been a change globally in timing? (Since the old version is from May 2019 I don't expect this to be on the top of anyone's mind.)

If timing is slightly different, it could possibly affect disk I/O. And that could certainly affect the ITS build.

larsbrinkhoff commented 3 years ago

Or magtape I/O for that matter.

The current state is that the build goes through the bootstrapping cycle. It boots off a tape, creates a file system on disk, transfers some files from tape to disk, boots ITS, assembles a few programs including ITS itself. It then tries to reboot and start the new ITS. This is where the build stops, because the new ITS doesn't come up fully.

I'm trying to narrow it down to a smaller reproducible test case without TCL involved, but I haven't succeeded so far.

rcornwell commented 3 years ago

I do use long delays and long after times to get past these things. However there is a timing issue in SCP, I have pointed this out to you months ago. It started with your fix to handle negative time values. This even effects simulators that do not use any timer interrupt. Currently i7090 will not run IBSYS, this is due to events not occurring when they should.

markpizz commented 3 years ago

What I would like to see if you plan to enhance SCP Expect/send is something closer to Expect/Send syntax, where you either send a chunk and the allow several expects, operations.

The syntax is never going to look like the TCL Expect/Send syntax for various reasons related to legacy requirements for parsing SCP commands. Meanwhile, functionality can be changed or extended and/or command variants can be added to address generally useful functionality. Please describe functionality that you would specifically find useful.

simh / simh

PDP10: (KS10) no longer builds ITS #999