Open larsbrinkhoff opened 3 years ago
This table describes my progress, and will be continuously edited.
Date | Commit | Status | Error |
---|---|---|---|
2018-09-11 | e1db7de | good | |
2019-01-22 | efd52d9 | good | |
2019-05-02 | 758bfb7 | good | |
2019-05-28 | 48db109 | good | |
2019-05-29 | 6a131ec | good | |
2019-05-29 | 45d8c90 | BAD | Crash in MARK <- CULPRIT |
2019-05-30 | 564ce2b | BAD | Crash in MARK |
2019-06-04 | 53ad66f | BAD | Crash in MARK |
2019-06-23 | 9c2c621 | BAD | Crash in MARK |
2019-08-12 | d6f9c58 | BAD | Crash in MARK |
2019-08-25 | 9539b62 | BAD | Undefined reference to pcap_lib_version |
2019-09-03 | 7398e63 | BAD | Crash in MARK |
2020-03-01 | f070150 | BAD | Crash in MARK |
2020-07-09 | e2d0095 | BAD | Crash in MARK |
2021-01-18 | 9f064db | BAD | Hangs after NITS boot |
"Crash in MARK" looks like this. The simulator is booted off tape and runs a program to format an RP06 disk pack.
This problem was introduced in May 2019.
sim> b tu1
ITS MTBOOT.176
MARK$G'
Format pack on unit #0
Drive 0 offline
Disk 0 (unit 0) is losing.
RH-11 disk ctl status: I = 0; Current drive is #0, which is virtual unit #0
Disk Attention Summary: 0000000000000001
Drive Status = (140400)DIFF LT 64 DRIVE-PRESENT ERR ATTENTION
Ctrl & Status 1 = (104222)READY DRIVE-AVAILABLE SPECIAL-CONDITION
Ctrl & Status 2 = (300)SILO-INPUT-READY SILO-OUTPUT-READY
Offset = (0)
Error 1 = (40001) ILL-FUNC UNSAFE
Error 2 = (0)
Error 3 = (0)
Word Count = 0., Unibus Address = 0
Desired Cyl = 0., Current Cyl = 0.
Desired Addr = 0 (Track 0. Sector 0.)
Do you want to see all the UNIBUS registers (Y or N)
I don't know what you're doing with that table. A git bisect walks through ranges of commits to find the point where a change broke something. The only interesting result is the specific commit that caused the first failure. Several of the commits you've listed merely change comments or no things that couldn't possibly influence simulator behavior.
What I'm doing with the table is to post notes as I make progress.
The commit 45d8c908ba1f153522d9575e373599f1ff19bae8 introduced the "Crash in MARK" problem. Since it's related to simulation timing, I will check if a build script delay fixes the problem.
I will also continue to bisect in the later range of the table since I saw a different problem there.
It will always be hard, if not impossible, for an external program (expect in this case) to interact precisely and consistently with a simulator. This is why SCP provides the EXPECT and SEND commands since they are precisely timed within the simulator using instruction/cycle progress. I'll be glad to add more functionality to SCP to extend this in other ways.
I don't think the exact timing of input is important here. Or at least it shouldn't be.
I have found that by putting in a delay where the "Crash in MARK" problem would usually appear, the error will not be triggered. However, there will still be a problem later in the process when ITS is running. So the delay is only a bandaid at best and possibly only papers over the underlying problem temporarily. Still, maybe it provides some clue.
I don't fully understand the commit 45d8c90, or why it impacts ITS and its tooling. I will work on making a small self-contained test case and post it here.
Here is a zip file with some inputs. Run the pdp10 simulator with the configuration error.simh
. It will attach a tape to tu1 and boot from it. I have used SIMH expect/send commands to mimic the TCL expect script I use.
test.zip
Now, on one of my machines this doesn't display the expected error with the latest commit 9f064db, but it does with 45d8c90. Here is a Travis CI run with 9f064db:
https://travis-ci.org/github/PDP-10/its/jobs/755157715#L1117
So there seems to be random and/or machine dependent component to this behavior.
That "error.simh" script doesn't run without further inputs (questions answered and/or additional other attached files).
PDP-10 simulator V4.0-0 Current git commit id: 9f064db5+uncommitted-changes
sim> do error.simh
/mnt/c/Users/Mark/Downloads/error.simh-3> at tu1 salv.tape
TU1: Tape Image 'salv.tape' scanned as SIMH format
/mnt/c/Users/Mark/Downloads/error.simh-5> at rp0 rp0.dsk
RP0: creating new file: rp0.dsk
ITS MTBOOT.176
MARK$G'
Format pack on unit #0
Are you sure you want to format pack on drive # 0 (Y or N) Y
Pack no ?
So, comparing that to the travis output seems like apples and oranges...
Do you mean rp0.dsk? It's not needed. What you see here is the same as my machine when running the 9f064db pdp10. Could you also try 45d8c90, please? Even though it doesn't reproduce the problem with the latest master commit, it could be interesting information. If your machine behaves the same as mine, maybe I can continue to search for a test case.
The Travis output is for the full TCL expect script, so yes it's apples and oranges. But for 45d8c90 the error message is the same on my machine.
To clarify, Are you sure you want to format pack on drive # 0 (Y or N)
means you did not get the error.
I get the same results on both the latest commit and 45d8c90.
Oops. Git protected me from myself (I had some changes in the working directory and thus it didn't let me switch to 45d8c90).
It fails with 45d8c90. However, since it DOES NOT fail with the latest commit, I see little value in determining why/when it got fixed. This falls into the "if it ain't broke, don't fix (or screw with) it".
Thanks for testing.
However, it's not a case of "if it ain't broke, don't fix (or screw with) it". It does break, just not on your computer. Mine behave differently, and some other machines in yet another way.
And also, the latest commit breaks the TCL script in another place, on all machines that I have seen.
Like I've said previously, I don't believe that an external script can reliably work. Such a script will be more reliable the slower it interacts with the simulated environment. The "distance" within the host OS and the separate processes not being in lock step will always be problematic.
Is it your conclusion that the build breaks because the external script is unreliable? I don't think so. The script just automates user input, and has delays in the input to simulate typing. The script worked reliably before 45d8c90, and it still works reliably on two other simulators: Cornwell's KA10/KL10, and Harrenstien's KLH10.
The fact that there is more than one process without hard synchronization between them on systems which may have a single processor or multiple ones, precise interprocess behavior can't possibly produce the same results everywhere and each time. Things will vary based on numerous factors which may come and go on the host system.
Is it your opinion that the problem can't be fixed, and this issue should be closed?
If you can demonstrate failure with SCP commands I'll dig in and fix what can be demonstrated, otherwise the TCL code is your problem to adjust where ever needed to get it to work.
Before you try and go down that path, try this change to the fei_svc routine in pdp10_fe.c and see if that changes the behavior:
t_stat fei_svc (UNIT *uptr)
{
int32 temp;
sim_clock_coschedule (uptr, tmxr_poll); /* continue poll */
if (M[FE_CTYIN] & FE_CVALID) /* previous character still pending? */
return SCPE_OK; /* wait until it gets digested */
temp = sim_poll_kbd (); /* get possible char or error? */
if (temp < SCPE_KFLAG) /* no char or error? */
return temp;
if (temp & SCPE_BREAK) /* ignore break */
return SCPE_OK;
uptr->buf = temp & 0177;
uptr->pos = uptr->pos + 1;
M[FE_CTYIN] = uptr->buf | FE_CVALID; /* put char in mem */
apr_flg = apr_flg | APRF_CON; /* interrupt KS10 */
return SCPE_OK;
}
I tried to make an SCP script to mimic the TCL script that demonstrates the "other" (I don't know if it's the same or something else) problem. But I had some problems making EXPECT/SEND do what I wanted. I can post my script here later and maybe you can see what's wrong.
Thanks, I'll try the fei_svc code.
My suggested code change was made without precise knowledge of how the various PDP10 operating systems actually flag that the prior received character has been received and processed. If this works better, then it proves my point about TCL driving from a separate process will absolutely be unreliable due to lack of synchronization. This change adds a degree of synchronization for just this activity in just this simulator and thus the problem remains for every other potential synchronization activity and simulator.
Your suggestion that the TCL script seemed to work reliably for some other simulators suggested that I think about how they were different and so I looked at Rich's console port simulation and he had adopted a model which I had added to several other simulators for their console I/O. Bob's original console I/O behaved as real hardware did, which presented console port data into the simulator exactly when it was received. On a real serial port characters arrived no faster than the console port speed. That gave the OS plenty of time to process it without any special need for synchronization, and if ever new data arrived before the prior data was processed, it merely overwrote the prior data (maybe - or maybe not indicating an overrun). When input to the session is arriving over some modern data path (often a socket), the network layer will endeavor to coalesce multiple characters into the same network packet. This can easily cause new data to overrun previously arrived data. The SCP EXPECT and SEND activities carefully synchronize data arrival and thus explicitly will avoid this problem. Meanwhile, you might wonder what the motivation to attempt to hold off accepting input in some simulator console I/O actually was. Well, with simulator consoles in telnet sessions OR in the context of the SCP process session, users today are more than used to copying things between windows with cut and paste. Supporting cut-and-paste is the motivation to synchronize the console input with the simulated OS.
So, did it work??
On the PDP10-KA and KI sim_poll_kbd is called at intervals whenever the PDP10 has no pending characters. On the KL things get rather complex depending on protocol being used. However there is a ring buffer that is 256 characters deep. However it is possible for SIMH Send to overrun this buffer if the FE is in secondary (or boot protocol). SIMH send can also overwhelm KA/KI input if the monitor in ONCE code.
Now Bob's KS10 code polls the input regardless if the CPU is ready to receive any input. This could be the problem. However this is probably what would have happened on real hardware since I do not believe that the 8080 console UART had any FIFO.
Now SCP SEND is far from perfect, it is very easy to overrun input even when they are not polling. Also characters can be dropped.
What I would like to see if you plan to enhance SCP Expect/send is something closer to Expect/Send syntax, where you either send a chunk and the allow several expects, operations.
@markpizz, I tested your fei_svc but it didn't make any noticeable difference.
I don't think the problem has anything to do with console I/O. I increased the typing delays in the TCL quite a lot, but still no difference.
Now SCP SEND is far from perfect, it is very easy to overrun input even when they are not polling. Also characters can be dropped.
See HELP SEND DELAY and HELP SEND AFTER. The delay (in instructions/cycles) between characters delivered and the instructions/cycles before the first send character. The default for these is 1000, but each SEND command can have arbitrary values as needed by what is going on, and the default can be easily changed. This behavior is precise. A simulator executing instructions quickly, or slowly due to potential disk I/O waits will still deliver data precisely into the simulated system.
I don't think the problem has anything to do with console I/O. I increased the typing delays in the TCL quite a lot, but still no difference.
As long as this change still works, I'll commit it since it adds cut-and-paste support to this simulator. Right?
Yes, your updated fei_svc works as far as I could see.
The TCL script is careful to limit the rate of console input. I tried to increase the delay between characters to a rather high value, but the failure still happened. This is why I don't think the problem is related to console I/O.
When I experimented with a SIMH script I noticed the same SEND DELAY parameter resulted in different speeds in the "old" working commit compared to the latest commit. That makes me wonder if there has been a change globally in timing? (Since the old version is from May 2019 I don't expect this to be on the top of anyone's mind.)
If timing is slightly different, it could possibly affect disk I/O. And that could certainly affect the ITS build.
Or magtape I/O for that matter.
The current state is that the build goes through the bootstrapping cycle. It boots off a tape, creates a file system on disk, transfers some files from tape to disk, boots ITS, assembles a few programs including ITS itself. It then tries to reboot and start the new ITS. This is where the build stops, because the new ITS doesn't come up fully.
I'm trying to narrow it down to a smaller reproducible test case without TCL involved, but I haven't succeeded so far.
I do use long delays and long after times to get past these things. However there is a timing issue in SCP, I have pointed this out to you months ago. It started with your fix to handle negative time values. This even effects simulators that do not use any timer interrupt. Currently i7090 will not run IBSYS, this is due to events not occurring when they should.
What I would like to see if you plan to enhance SCP Expect/send is something closer to Expect/Send syntax, where you either send a chunk and the allow several expects, operations.
The syntax is never going to look like the TCL Expect/Send syntax for various reasons related to legacy requirements for parsing SCP commands. Meanwhile, functionality can be changed or extended and/or command variants can be added to address generally useful functionality. Please describe functionality that you would specifically find useful.
The scripted ITS build (https://github.com/PDP-10/its) no longer works with the PDP10 simulator (KS10). I'm testing commits from the version history since the last known working revision. I will provide more details below as I come across them.