Boot hangs as result of NE2K driver probe fail with no NIC

toncho11 commented 1 year ago

IMG_20221230_185501

This is on IBM 5160, 64 kb motherboard with 640 kb ram installed booting mimix version of ELKS.

toncho11 commented 1 year ago

The same works on Pravetz 16 and IBM 5160 256KB version ...

ghaerr commented 1 year ago

@toncho11, does this fail in the same place every time, or slightly different?

Also, do you normally compile in support for the network drivers? Just trying to see what might be different that is causing this.

I did make one (what I thought was small) change to the kernel in https://github.com/jbruchon/elks/commit/61fb3b3dda7dd9aa71631145311d6fea7a28ca29 which was tested on QEMU but not real hardware. Perhaps we should roll back prior to that to see if things work.

toncho11 commented 1 year ago

I always fails on the same place on the 64 kb version of 5160. It failed 3 times in a row. It failed at least one time on the 256 kb version of 5160, but it works in general.

toncho11 commented 1 year ago

No network card is installed.

ghaerr commented 1 year ago

Can you try removing the network drivers by unsetting all CONFIG_ETH_*? That will determine if the problem is in the network driver identification code, or possibly a timer interrupt issue that I was working on with @Vutshi.

ghaerr commented 1 year ago

I also disabled disable_timer_tick in elks/arch/i86/kernel/timer-8254.c, which could be the issue. Remove the #if NOTNEEDED around this code to test. Thank you!

void disable_timer_tick(void)
 {
 #if NOTNEEDED
     outb (TIMER_MODE0, TIMER_CMDS_PORT);
     outb (0, TIMER_DATA_PORT);
     outb (0, TIMER_DATA_PORT);
 #endif
 }

ghaerr commented 1 year ago

If you have not tested the kernel boot since Dec 19, that's when @Mellvik last updated the network drivers, and if the kernel is hanging in the same place, it could be in NIC hardware identification code. (Or it could just happen to be a timer interrupt occurs at very nearly the same point, which has also been worked on. That's why it is important to know whether it dies in exactly the same point displayed on the screen or not every time).

ghaerr commented 1 year ago

Finally, looking at the NE2K driver, it is incorrectly identifying an NE2K as present, even though there is no NIC installed. After the displayed line, the driver attempts to talk with the non-existent card, which could be causing some issues, Not sure yet. I would suggest recompiling without any NIC driver support to see what happens.

toncho11 commented 1 year ago

It was 3 times the same point on the screen.

ghaerr commented 1 year ago

Set the following in .config and make kclean; make

# CONFIG_ETH is not set
# CONFIG_ETH_NE2K is not set
# CONFIG_ETH_WD is not set
# CONFIG_ETH_EL3 is not set

toncho11 commented 1 year ago

Can you please provide me a 360 kb image without any NIC driver support?

ghaerr commented 1 year ago

Here you go:

fd360-minix.img.zip

Mellvik commented 1 year ago

Hmmm, what does '64kb motherboard' mean? Is the rest of the memory on ISA?-M30. des. 2022 kl. 19:47 skrev Gregory Haerr @.***>: Here you go: fd360-minix.img.zip

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

toncho11 commented 1 year ago

Hmmm, what does '64kb motherboard' mean? Is the rest of the memory on ISA?-M30. des. 2022 kl. 19:47 skrev Gregory Haerr @.>: Here you go: fd360-minix.img.zip —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

It is explained here: https://www.minuszerodegrees.net/5160/motherboard/5160_motherboard_revisions.htm I have both.

Mellvik commented 1 year ago

Thank you, @toncho11 - I should have remembered. I bought a 5160 in 1984 and did the mods a couple of years later. Ancient history.

I suspect removing the network drivers from the kernel improved the situation, right? ISA device probes are not reliable. There may be something else at IO address 300 - or, on such an old system, reads from non-exisiting ports may deliver random results because of insufficient termination. SO - as @ghaerr points out, getting rid of everything unneeded from the kernel is not only a good thing, it may be required.

-M

toncho11 commented 1 year ago

So @Mellvik, @ghaerr

I testest latest ELKS f1254f6

it does work with XT-IDE attached (that is why it works on my other IBM 5160 I suppose)
fails as explained earlier without XT-IDE attached

I tested without network probing as provided by @ghaerr :

works in both cases

There should be some decision here:

network probing to be improved ?
network probing should not be in the the default build ?
we can add another automatic image that has network probing enabled ?
there is some option in the kernel loading "Press S to skip network probing" or "Safe mode" where other features are disabled as well ?

toncho11 commented 1 year ago

Yes, my XT-IDE is on 300 I think. I will be putting the PC back to the shelf, so I won't be able to test more for a while.

Mellvik commented 1 year ago

Hi @toncho11,

there are in fact several things to consider here:

This problem has nothing to do with networking, but with ISA probing in general IO address 'collisions' must be resolved manually, there is no way to resolve them programmatically. The ELKS default configuration assigns port 0x300 to the ne2k card which is the most common 'user' of that address - making life easy for development since it's also the address QEMU is using. Incidentally, 0x300 is also the default address for the old XT-IDE interface, which means you either have to reconfigure the IO settings, remove unneeded devices from the configuration or (much easier) just edit the bootopts file - change the address setting for the ne2k interface - to, say, 320 or 380 or something that isn't already used on the system.

That said, the probing in any ISA driver can always be improved. Now that we know that the XT-IDE interface commonly uses the same (default) IOaddress, it's possible to change the test pattern to take that into account. In order to do that we would need someone to test it. If you volunteer, I'll create a test image for you.

Thank you.

-M

des. 2022 kl. 12:36 skrev toncho11 @.***>:

So

I testest latest ELKS f1254f6 https://github.com/jbruchon/elks/pull/1499/commits/f1254f649c0efc3983e06d38d1f21bfd24731fc0 it does work with XT-IDE attached (that is why it works on my other IBM 5160 I suppose) fails as explained earlier without XT-IDE attached I tested without network probing as provided by @ghaerr https://github.com/ghaerr :

works in both cases There should be some decision here:

network probing to be improved ? network probing should not be in the the default build ? we can add another automatic image that has network probing enabled there is some option in the kernel loading "Press S to skip network probing" or "Safe mode" where other features are disabled as well ? — Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1500#issuecomment-1368204180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOBPRDJYX7S2OOLJEWTWQALFNANCNFSM6AAAAAATNBR2GA. You are receiving this because you were mentioned.

toncho11 commented 1 year ago

@Mellvik Did you notice that it actually loads OK with XT-IDE attached! It does not in the case where there is nothing attached. I mean it is not the other way around. So ELKS loading on a PC with no network card (and nothing else attached) should not fail, but it does on this particular system.

toncho11 commented 1 year ago

Hi @Mellvik,

Finally, looking at the NE2K driver, it is incorrectly identifying an NE2K as present, even though there is no NIC installed. After the displayed line, the driver attempts to talk with the non-existent card, which could be causing some issues, Not sure yet. I would suggest recompiling without any NIC driver support to see what happens.

@ghaerr identified a problem with the NE2K driver. Can you please have a look at it when you have the time?

ghaerr commented 1 year ago

Hello @toncho11 and @Mellvik,

I'm glad @toncho11 has identified an ELKS boot issue and suggested some paths forward, and that we can safely say all the previous changes removing the timer disable code are working. Out of @toncho11's options, I would say that we can probably improve the NE2K probing in a way such that a network-enabled kernel doesn't hang on boot doesn't hang when no NIC is installed. This would prevent a major boot issue since our default is to include NIC support in the kernel now.

@Mellvik's observation last week on @tyama501's PC-98 serial driver kernel hang issue brought up the point that if the kernel is written with the possibility of hanging, it probably will, eventually. In this case, a very real purpose of the NIC probe should be to prohibit the kernel from getting confused and writing to the wrong device, hanging the kernel. (We have another case of this in the IDE Query code hanging the kernel (more on that below).

@Mellvik, can you describe the way the NE2K implements its probing? We can then discuss possible solutions. I'm not sure exactly why the hang occurs, although I do see a new #if DELETEME which removes a second-check for a valid MAC address that could help.

Another kernel hang possibility is in idequery.c, in the following code (I found this while attempting to get ELKS to boot in the blink emulator and had to work around it by emulating an IDE port):

    while (1) {
        out_hd(drive, IDE_DRIVE_ID);
            while (WAITING(port)); // <--- this code should be coded PC-98 serial and return if timeout

Thank you!

Mellvik commented 1 year ago

Seriously guys, which part of my previous message was hard to understand?

The ne2k probe works fine. If you put in a different interface using the given IO address most ISA probes may mistakenly think they found 'their' card. So the probe may be improved but is it worth it? I don't think so, but as I suggested - with some joint effort it can be done.

In this case, the fix is to edit one byte in bootopts.

Of course I may be mistaken, but I don't see that there is a problem here, just a misunderstanindg of how things work / are supposed to work.

Happy new year.

-M

des. 2022 kl. 18:07 skrev Gregory Haerr @.***>:

Hello @toncho11 https://github.com/toncho11 and @Mellvik https://github.com/Mellvik,

I'm glad @toncho11 https://github.com/toncho11 has identified an ELKS boot issue and suggested some paths forward, and that we can safely say all the previous changes removing the timer disable code are working. Out of @toncho11 https://github.com/toncho11's options, I would say that we can probably improve the NE2K probing in a way such that a network-enabled kernel doesn't hang on boot doesn't hang when no NIC is installed. This would prevent a major boot issue since our default is to include NIC support in the kernel now.

@Mellvik https://github.com/Mellvik's observation last week on @tyama501 https://github.com/tyama501's PC-98 serial driver kernel hang issue brought up the point that if the kernel is written with the possibility of hanging, it probably will, eventually. In this case, a very real purpose of the NIC probe should be to prohibit the kernel from getting confused and writing to the wrong device, hanging the kernel. (We have another case of this in the IDE Query code hanging the kernel (more on that below).

@Mellvik https://github.com/Mellvik, can you describe the way the NE2K implements its probing? We can then discuss possible solutions. I'm not sure exactly why the hang occurs, although I do see a new #if DELETEME which removes a second-check for a valid MAC address that could help.

Another kernel hang possibility is in idequery.c, in the following code (I found this while attempting to get ELKS to boot in the blink emulator and had to work around it by emulating an IDE port):
while (1) {
    out_hd(drive, IDE_DRIVE_ID);
        while (WAITING(port)); // <--- this code should be coded PC-98 serial and return if timeout
Thank you!

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1500#issuecomment-1368253553, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOFJYFIAT5Y266MI5GLWQBR5JANCNFSM6AAAAAATNBR2GA. You are receiving this because you were mentioned.

tyama501 commented 1 year ago

It might be hard to modify bootopt without booting.

Happy New Year!

ghaerr commented 1 year ago

Hello @Mellvik,

The ne2k probe works fine. If you put in a different interface using the given IO address most ISA probes may mistakenly think they found 'their' card.

I guess the part I believe should probably change is that, when no NIC is present, and no IDE controller either (that is, nothing at port 0x300), the kernel, at least on @toncho11's actual IBM machine, incorrectly says there's a NE2K NIC present, then hangs. One shouldn't have to edit a (/bootopts) file in order to boot our standard kernel with no NIC in order to avoid a system hang, I would think. Perhaps this problem is occurring because this is the first time the NE2K driver has been tested without an IDE controller present?

I notice that the probe code is very simple: it outputs 0x20 to (default) I/O address 0x300, then inputs and compares to 0x00 or 0xFF, which indicate no NIC... Do you suppose the IBM 5160 is returning some random value from that address? Or does port 0x20 output need time to settle, as there is no pause between successive output and input instructions, which IIRC were required on early slow machines.

What does an IDE controller do when 0x20 is sent to 0x300, are all machines supposed to return either 0 or 0xff with no device present?

Happy New Year!

Mellvik commented 1 year ago

@ghaerr,

The ne2k probe works fine. If you put in a different interface using the given IO address most ISA probes may mistakenly think they found 'their' card.

I guess the part I believe should probably change is that, when no NIC is present, and no IDE controller either (that is, nothing at port 0x300), the kernel, at least on @toncho11 https://github.com/toncho11's actual IBM machine, incorrectly says there's a NE2K NIC present, then hangs. One shouldn't have to edit a (/bootopts) file in order to boot our standard kernel with no NIC in order to avoid a system hang, I would think. Perhaps this problem is occurring because this is the first time the NE2K driver has been tested without an IDE controller present?

I believe this was covered in my message with the bullet points. Given the nature of the ISA bus the only thing a probe can do reliably is to verify the absence or presence of an interface. It's possible that the earliest PCs lacked proper bus termination and the response when writing to, then reading from a port with no connection is unpredictable (should be 0xff). If that's the case, probing will be unreliable guesswork regardless. That said, I was reading @toncho11's messages differently, that the hang happens when the IDE card is present at 0x300, not when it's absent. In that case the behaviour is as expected, and IMHO the user must fix it. We cannot possibly adapt the default ELKS setup to a very rare case and let the more regular cases suffer. Fixing/adjusting bootopts before generating a floppy is the normal when testing out new hardware, and should be the normal in this case too. As should running a menuconfig and build to maximize chances for success. I notice that the probe code is very simple: it outputs 0x20 to (default) I/O address 0x300, then inputs and compares to 0x00 or 0xFF, which indicate no NIC... Do you suppose the IBM 5160 is returning some random value from that address? Or does port 0x20 output need time to settle, as there is no pause between successive output and input instructions, which IIRC were required on early slow machines.

It's quite possible that a wait loop would be useful in the probe routine if the assumption above about unterminated bus is correct. @toncho11, you did not respond to my proposal to track down this jointly. A few commands in MSDOS DEBUG is all that's needed. What does an IDE controller do when 0x20 is sent to 0x300, are all machines supposed to return either 0 or 0xff with no device present?

Most newer (AT and up) will return FF on read from a nonexisting port. A simple probe is: read port - if FF then write something, say 0x20, then read back. If it's still FF we're safe, nothing there. The 20 in this case increases the chance of differentiation between a ne2k and something else (we should read back 0x81 or 0x1). What a different controller may respond with is anyone's guess. I'm not sure two IDE controllers would respond the same even, they usually depend on on-board ROM code to run. Happy New Year!

Indeed, 2023 is here. Happy New Year.

Thank you.

-M

toncho11 commented 1 year ago

Happy new year! :)

Unfortunately I have to admit that I do not agree. I am glad @Mellvik that after several additional clarification messages you got my original message correctly. Also @Mellvik how can you say that my case - a standard original IBM 5160 with no cards attached is a rare case? I personally think it is not a rare setup at all. Interesting, I can test on 3 more different machines ... and if it fails would this be enough to be considered as non rare case? In my opinion the network probing should be disabled in the default builds if it is kind of unpredictable. Or we can add two more builds fd360-noeth, fd1440-noeth for example and the problem will be kind of solved.
I also do not agree that the users should be required to recompile ELKS each time. That is why so many builds were made so that it is easy, easy like in less effort and for people with less experience. Some people never compiled or configured a kernel. Do not get me wrong, I actually like the network probing. And probably it took a lot of effort to develop. I also have a network card.

Mellvik commented 1 year ago

Hi @toncho11,

you're of course welcome to have your opinions. And I may be guilty at not completely understanding your scenario all the time.

That is reciprocal however. If you'd read my messages you'd understood by now that this has nothing to do with networking, nothing to do with network probing. This is how the ISA bus works, like it or not.

Let me know if you'd like to contribute to sort out your issues.

-M

jan. 2023 kl. 18:10 skrev toncho11 @.***>:

Happy new year! :)

Unfortunately I have to admit that I do not agree. I am glad @Mellvik https://github.com/Mellvik that after several additional clarification messages you got my original message correctly. Also @Mellvik https://github.com/Mellvik how can you say that my case - a standard original IBM 5160 with no cards attached is a rare case? I personally think it is not a rare setup at all. Interesting, I can test on 3 more different machines ... and if it fails would this be enough to be considered as non rare case? In my opinion the network probing should be disabled in the default builds if it is kind of unpredictable. Or we can add two more builds fd360-noeth, fd1440-noeth for example and the problem will be kind of solved. I also do not agree that the users should be required to recompile ELKS each time. That is why so many builds were made so that it is easy, easy like in less effort and for people with less experience. Some people never compiled or configured a kernel. Do not get me wrong, I actually like the network probing. And probably it took a lot of effort to develop.

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1500#issuecomment-1368490505, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOF6N35L57OAV636VG3WQG27NANCNFSM6AAAAAATNBR2GA. You are receiving this because you were mentioned.

ghaerr commented 1 year ago

Hello @toncho11 and @Mellvik -

I'm a bit confused here - are we saying that nothing can be done about this? To reiterate my understanding: the PC in question has no cards installed - no NIC, no IDE controller. Yet ELKS hangs on boot because it thinks there's a NIC present. IMO, we can do better. No, we're not going to create even more ELKS images, our default image should just work in this case.

Given that we're not just arbitrarily searching for any I/O port address, and the address in question is known to possibly be used as a NIC or IDE controller, I think we could likely differentiate between the two (or nothing). That is, instead of just checking 0x00 or 0xFF return, possibly confirm / reject with other results, to avoid an ELKS kernel hang - something that is a big turnoff for most users.

Thank you!

ghaerr commented 1 year ago

Hello @Mellvik,

I know very little about the ISA bus, so forgive me as I ask a couple possibly dumb questions (and thanks):

Most newer (AT and up) will return FF on read from a nonexisting port.

What does this mean for older systems? I am a bit concerned about the NE2K issue, because we don't really know what the reason is for the kernel hang - although I suspect that the init routine (after thinking the NIC is present) is busy looping trying to get a result that is not forthcoming. Could older systems just hang on the IN or OUT instruction forever, or will they always succeed and possibly return any value? It would appear that @toncho11's system is not returning 00 or FF, thus passing the initial probe test.

A simple probe is: read port - if FF then write something, say 0x20, then read back.

Why test for FF, if FF means nonexistant port? I don't quite understand that. Why not also write a command value and see whether the controller responds with an appropriate value (like 81 or 01 below)?

If it's still FF we're safe, nothing there.

In the NE2K probe, we don't do this. We only check once for FF, then return not present. In this case, we then assume present, and there isn't a code path that tests for a second FF return value after sending 20, and rejecting the NIC. The test below is done after the probe:

The 20 in this case increases the chance of differentiation between a ne2k and something else (we should read back 0x81 or 0x1). What a different controller may respond with is anyone's guess. I'm not sure two IDE controllers would respond the same even, they usually depend on on-board ROM code to run.

Agreed. It would seem that the above 2nd read check should be in the probe routine.

Given the nature of the ISA bus the only thing a probe can do reliably is to verify the absence or presence of an interface.

What specifically does this mean? IN and OUT instructions work, correct? So all we're saying is that IN and OUT are they only instructions that can be used for ISA bus, vs other instructions for another type of bus?

Thank you!

Mellvik commented 1 year ago

@ghaerr,I'll come back to the specifics in your message, but for now it seems reasonable to treat the pre-AT systems as a new platform variant with different bus behaviour (which is a fact), not pretending there is something wrong with what we already have. This is development. When we've determined the specifics of the differences, new requirements, we'll just bake them into ELKS with code and/or ifdefs - as usual.Uncomplicated - a few tests on actual hardware and we know what we're dealing with. Probably minor adjustments. I do not have such hardware available, but like I've communicated several times before, I'll be happy to provide test instructions to get us started.I'm not worried about the hangs. When OS and hardware don't match, hangs are inevitable. Even the pdp11 running a production OS used to hang when the configs didn't match. -M1. jan. 2023 kl. 19:11 skrev Gregory Haerr @.***>: Hello @Mellvik, I know very little about the ISA bus, so forgive me as I ask a couple possibly dumb questions (and thanks):

Most newer (AT and up) will return FF on read from a nonexisting port.

What does this mean for older systems? I am a bit concerned about the NE2K issue, because we don't really know what the reason is for the kernel hang - although I suspect that the init routine (after thinking the NIC is present) is busy looping trying to get a result that is not forthcoming. Could older systems just hang on the IN or OUT instruction forever, or will they always succeed and possibly return any value? It would appear that @toncho11's system is not returning 00 or FF, thus passing the initial probe test.

A simple probe is: read port - if FF then write something, say 0x20, then read back.

Why test for FF, if FF means nonexistant port? I don't quite understand that. Why not also write a command value and see whether the controller responds with an appropriate value (like 81 or 01 below)?

If it's still FF we're safe, nothing there.

In the NE2K probe, we don't do this. We only check once for FF, then return not present. In this case, we then assume present, and there isn't a code path that tests for a second FF return value after sending 20, and rejecting the NIC. The test below is done after the probe:

The 20 in this case increases the chance of differentiation between a ne2k and something else (we should read back 0x81 or 0x1). What a different controller may respond with is anyone's guess. I'm not sure two IDE controllers would respond the same even, they usually depend on on-board ROM code to run.

Agreed. It would seem that the above 2nd read check should be in the probe routine.

Given the nature of the ISA bus the only thing a probe can do reliably is to verify the absence or presence of an interface.

What specifically does this mean? IN and OUT instructions work, correct? So all we're saying is that IN and OUT are they only instructions that can be used for ISA bus, vs other instructions for another type of bus? Thank you!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ghaerr commented 1 year ago

@Mellvik:

I checked the Linux 2.0 source to see what it does for probing, and I found that it performs a check to determine if an 8390 is present as the 2nd item in its otherwise-lengthy probe, after checking for FF like we do. The 8390 test uses the read-only COUNTER register (at offset 0x0D, same as the TXC transmit configuration write-only register) to force a register increment by setting it to 0xFF then reading it, checking for increment to 0:

    int reg0 = inb_p(ioaddr);
    if (reg0 == 0xFF)
        return ENODEV;

    /* Do a preliminary verification that we have a 8390. */
    {   int regd;
        outb_p(E8390_NODMA+E8390_PAGE1+E8390_STOP, ioaddr + E8390_CMD);
        regd = inb_p(ioaddr + 0x0d);
        outb_p(0xff, ioaddr + 0x0d);
        outb_p(E8390_NODMA+E8390_PAGE0, ioaddr + E8390_CMD);
        inb_p(ioaddr + EN0_COUNTER0); /* Clear the counter by reading. */ // <--- counter increments from FF to 0 if 8390
        if (inb_p(ioaddr + EN0_COUNTER0) != 0) {
            outb_p(reg0, ioaddr);
            outb_p(regd, ioaddr + 0x0d);        /* Restore the old values. */
            return ENODEV;
        }
    }

With just a few more instructions than our current probe, this might be a mechanism that would work across XT and AT systems, by verifying a chip rather than relying on a specific bus behavior. It goes to the trouble of restoring the old value, which we may not need.

toncho11 commented 1 year ago

Hi @toncho11, you're of course welcome to have your opinions. And I may be guilty at not completely understanding your scenario all the time. That is reciprocal however. If you'd read my messages you'd understood by now that this has nothing to do with networking, nothing to do with network probing. This is how the ISA bus works, like it or not. Let me know if you'd like to contribute to sort out your issues.

Hmm. Saying "nothing to do" confuses me. For me it is the fact that disabling the network probing fixes the crashing of ELKS. The network probing requires some communication with the ISA bus to test the presence of a specific card and that leads to a problem. This is how I see it. I can not fully enter into the technical details though, I admit that. Anyway it is pointless to argue any more.

Mellvik commented 1 year ago

@ghaerr,

A simple probe is: read port - if FF then write something, say 0x20, then read back.

Why test for FF, if FF means nonexistant port? I don't quite understand that.

Pardon me, but I'm losing you here. We want to test for presence and reading FF is a reliable indicator. why wouldn't we use the? And remember - the 'simple probe' example quoted above is an example, attempting to convey the idea. I never said this is what we're doing in the ne2k probe - which is simpler and more efficient while following the same idea. Why not also write a command value and see whether the controller responds with an appropriate value (like 81 or 01 below)?

That's what the ne2k probe is doing, so I still don't understand what the question is. We cannot test for 81 or 1 because different interface types respond slightly differently at this point in initialization, and the same interface may respond differently depending on whether we're at power up or just reboot. There are no rules, so experience rules. If it's still FF we're safe, nothing there.

In the NE2K probe, we don't do this. We only check once for FF, then return not present. In this case, we then assume present, and there isn't a code path that tests for a second FF return value after sending 20, and rejecting the NIC. The test below is done after the probe:

I don't know what code you're reading, it cannot be the ne2k probe because it's different from what you describe. The 20 in this case increases the chance of differentiation between a ne2k and something else (we should read back 0x81 or 0x1). What a different controller may respond with is anyone's guess. I'm not sure two IDE controllers would respond the same even, they usually depend on on-board ROM code to run.

Agreed. It would seem that the above 2nd read check should be in the probe routine.

Given the nature of the ISA bus the only thing a probe can do reliably is to verify the absence or presence of an interface.

What specifically does this mean? IN and OUT instructions work, correct? So all we're saying is that IN and OUT are they only instructions that can be used for ISA bus, vs other instructions for another type of bus?

I'm sorry @ghaerr - at this point I'm giving up. I just don't see how a (to me) clear statement about what an ISA probe can do reliably can create that kind of inference. I rest my case so to speak.

Thank you.

—M

Mellvik commented 1 year ago

@ghaerr,

to me this is barking up the wrong tree - like adjusting the carburettor when the brakes are failing.

Keep in mind: The current ELKS ISA probes work fine when the bus acts predictable. In this case it does not. Instead of wasting time trying to cure symptoms we need to find out why - and how the bus is different.

SO let's figure out how the early ISA bus behaves when reading or writing nonexistent ports. Gathering such knowledge should take a couple of minutes at most using DEBUG on MSDOS.

Quoting from https://gist.github.com/PhirePhly/2209518 SD0-SD15: System Data lines, or Standard Data Lines. They are bidrectional and tri-state. On most systems, the data lines float high when not driven.

Obviously, the data lines do not float high on @toncho11's systems, and for all I know (need to check the schematics for this) a read to a non existing port on such as bus may hang. It does not sound likely though - it would be bad design.

What would be interesting to know - in addition to probing the ports in DEBUG - is @.***, did I read you correctly that you have two 5160 systems?): Do the two systems -without the XT-ISA interface - hang the same way when booting the standard ELKS kernel? Also, does the XT-ISA interface work? We need to rule out real hardware problems, i'm assuming the power on memory test is OK - or is the system too old for that?

@toncho11, let us know when you're ready to participate in investigating the issue. I can send you some DEBUG commands for testing if required.

@ghaerr, the probe code from Linux increases the chance of IDing the actual interface with some precision, but that's not our problem. I'm not sure the extra code is useful enough to qualify the space it takes. It depends on where we want to take ELKS. We're coming from a place where it was expected that the user configures a system to match the hardware. The further we move away from that, the larger the codebase and ram footprint.

-M

jan. 2023 kl. 22:18 skrev Gregory Haerr @.***>:

@Mellvik https://github.com/Mellvik:

I checked the Linux 2.0 source to see what it does for probing, and I found that it performs a check to determine if an 8390 is present as the 2nd item in its otherwise-lengthy probe, after checking for FF like we do. The 8390 test uses the read-only COUNTER register (at offset 0x0D, same as the TXC transmit configuration write-only register) to force a register increment by setting it to 0xFF then reading it, checking for increment to 0:
int reg0 = inb_p(ioaddr);
if (reg0 == 0xFF)
    return ENODEV;

/* Do a preliminary verification that we have a 8390. */
{   int regd;
    outb_p(E8390_NODMA+E8390_PAGE1+E8390_STOP, ioaddr + E8390_CMD);
    regd = inb_p(ioaddr + 0x0d);
    outb_p(0xff, ioaddr + 0x0d);
    outb_p(E8390_NODMA+E8390_PAGE0, ioaddr + E8390_CMD);
    inb_p(ioaddr + EN0_COUNTER0); /* Clear the counter by reading. */ // <--- counter increments from FF to 0 if 8390
    if (inb_p(ioaddr + EN0_COUNTER0) != 0) {
        outb_p(reg0, ioaddr);
        outb_p(regd, ioaddr + 0x0d);        /* Restore the old values. */
        return ENODEV;
    }
}
With just a few more instructions than our current probe, this might be a mechanism that would work across XT and AT systems, by verifying a chip rather than relying on a specific bus behavior. It goes to the trouble of restoring the old value, which we may not need.

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1500#issuecomment-1368539520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOCS3432YMAYNY763CDWQHYCTANCNFSM6AAAAAATNBR2GA. You are receiving this because you were mentioned.

toncho11 commented 1 year ago

The ISA probing is the opposite of "expected that the user configures a system to match the hardware.", is it not? Both of my IBMs 5160 work fine. Memory self tests of the 640 KB memory are OK. DOS loads OK. I can not test much, only for limited time. Each system require that I put it off the shelf, assemble it until it is in working condition. OK, after that, we have different versions of the BIOS. One useful thing could be that the ISA probing is kept between two messages "Starting ISA probing ...", "Done ISA probing". This way it will be clear where it failed when other people are using it and they report problems. I think ELKS is an OS for both XT and AT. We should not privilege one or the other.

Mellvik commented 1 year ago

@toncho11,

I think ELKS is an OS for both XT and AT. We should not privilege one or the other.

ELKS is the OS for many platforms. XT and before apparently not fully supported yet and will remain so until someone is sufficiently interested to participate.

The required adjustments in ELKS will likely be minimal.

-M

ghaerr commented 1 year ago

Hello @Mellvik and @toncho11,

I hope to provide a path forward for the issue(s) brought up in this thread, as well as a few thoughts on "what ELKS is for". I'll grab various statements and comment...

A couple definitions for my comments here: A "network probe" and "ISA probe" are two different things - an ISA probe is supposed to determine whether any device is at an address on the ISA bus, while a "network probe" is supposed to determine whether a specific NIC is present.

This problem has nothing to do with networking, but with ISA probing in general We want to test for presence and reading FF is a reliable indicator. The current ELKS ISA probes work fine when the bus acts predictable. In this case it does not. Obviously, the data lines do not float high on @toncho11's systems,

On the strict matter of "ISA probing", I believe the above statements to be accurate, though only for IBM PC AT and later systems. That is, to determine if ANY TYPE of card is present on a port on the ISA bus, one must execute an IN instruction for that I/O port address, and, if FF returned, assume not present. For IBM PC 5160 (or possibly XT and earlier), it seems this is not true. That is, a value not equal to FF may be returned when there is NOT ANY card present at that address.

The NE2K driver currently executes the following code for its ISA probe:

        err = ne2k_probe();
        printk("eth: %s at 0x%x, irq %d", dev_name, net_port, net_irq);
        if (err) {
            printk(" not found\n");
            break;
        }
        found = 1;

The actual probe routine is in ASM:

ne2k_probe:

        // Poke then peek at the base address of the interface.
        // If something is there, return 0.
        // No attempt is made to get details about the i/f.

        mov     net_port,%dx    // command register
        mov     $0x20,%al       // set page 0
        out     %al,%dx
        in      %dx,%al
        cmp     $0xff,%al       // cannot be FF
        jz      np_err
        cmp     $0,%al          // cannot be 0
        jz      np_err
        xor     %ax,%ax
        jmp     np_exit
np_err:
        mov     $1,%ax
np_exit:
        ret

The first thing to recognize here is that the NE2K driver ne2k_probe performs a proper ISA probe (while also rejecting 00 return values, which does not concern us here). However, the NE2K driver only uses the ISA probe to determine whether an NE2K NIC is present. This means that any card plugged in at the netport address that responds not equal to FF will be seen as an NE2K NIC. Therein lies the problem for early XT systems.

and for all I know (need to check the schematics for this) a read to a non existing port on such as bus may hang. It does not sound likely though - it would be bad design.

Since the kernel prints the eth: ne0 at 0x300, irq 12, and does not print not found, we know the code path is such that the "ISA probe" succeeded, and the kernel hang is later in the NE2K driver. Thus, we also know the IBM 5160 is not hanging on an IN or OUT instruction.

SO let's figure out how the early ISA bus behaves when reading or writing nonexistent ports. Gathering such knowledge should take a couple of minutes at most using DEBUG on MSDOS.

Yes - that would be nice to know. But it doesn't change anything, since the the IBM 5160 is obviously returning some value between 1 and 254 inclusive, thus essentially failing the ISA-probe-only mechanism currently used to determine whether a NIC card is present. There should be an additional check that the card set at the netport address is indeed a NIC, rather than say, an IDE controller.

the probe code from Linux increases the chance of IDing the actual interface with some precision

Yes.

Both the wd and 3c NIC drivers perform more than a strict ISA probe to determine their chip is present - only the ne driver performs just an ISA probe, with no attempt at chip identification until past the probe routine returning -ENODEV.

It depends on where we want to take ELKS. We're coming from a place where it was expected that the user configures a system to match the hardware.

I feel lucky to have the team of contributors we have on ELKS. @Mellvik, you've been associated with ELKS longer than anyone, and I really appreciate the huge amount of time you've put in testing and enhancing the NIC device drivers. @toncho11 regularly contributes testing and (rare) common-sense comments on the way users will perceive ELKS.

I myself like to write software, not just for myself, but for others. Writing software itself is not enough, I want people to run it, without having to be a developer. I want more users to come to our project, download it and give it a try. If the software does not work, I feel many will quietly just leave without saying anything, and never come back - just like a bad restaurant.

ELKS itself is advertised as an OS for any 8086 based system, PC, XT, AT and other "nearly compatibles". Many contributors have put in a lot of work to ensure the kernel boots and performs floppy identification and I/O with older BIOSes, including the very first PC BIOS. I feel that our standard distribution needs to work out-of-the-box on all PC, XT and AT hardware, which it does, except now we find for this issue it does not.

I'm not sure the extra code is useful enough to qualify the space it takes.

That could be. We might measure how much code we're talking about. I suspect less than 50 bytes.

We're coming from a place where it was expected that the user configures a system to match the hardware.

If desired, we can remove the network drivers from the default distribution, that will allow booting ELKS for all IBM PC users. However, it would be a shame to not allow others to see ELKS networking in operation, should they have hardware installed. We've put a lot of work into that.

XT and before apparently not fully supported yet and will remain so until someone is sufficiently interested to participate.

XT is supposed to be fully supported. I am willing to write and post a PR with the suggested 8390 chip identification code after the ISA probe, but only tested on QEMU, if desired (or @Mellvik you can, if you would prefer). This will very likely fix this problem, although it could introduce another issue - that of interference with an IDE controller at the same address, being reprogrammed during the 8390 chip identification code. That will have to be tested on @toncho11's system. I am going to read up further on the 8390 counter register and look at the IDE controller register (if any) at address port+0x0d, and will report further information.

Thank you!

toncho11 commented 1 year ago

Thank you @ghaerr! Much appreciated on both the technical and non technical level of your post! :)

Mellvik commented 1 year ago

Great summary @ghaerr, and a number of interesting discussion points that I may follow up on. Appreciated.What I have a hard time with is that this is a proverbial tempest in a teapot. We're spending hours speculating and arguing while the tiny problem can be solved in a few minutes with hardware - as I have pointed out a number of times. Without hardware everything is speculation. This just doesn't make sense.Let's stop wasting time and concentrate on things that can be fixed - and tested.I'll start a new thread on ISA probing separately - we need a common understanding of what's useful and what's possible.-M2. jan. 2023 kl. 20:19 skrev Gregory Haerr @.***>: Hello @Mellvik and @toncho11, I hope to provide a path forward for the issue(s) brought up in this thread, as well as a few thoughts on "what ELKS is for". I'll grab various statements and comment... A couple definitions for my comments here: A "network probe" and "ISA probe" are two different things - an ISA probe is supposed to determine whether any device is at an address on the ISA bus, while a "network probe" is supposed to determine whether a specific NIC is present.

This problem has nothing to do with networking, but with ISA probing in general We want to test for presence and reading FF is a reliable indicator. The current ELKS ISA probes work fine when the bus acts predictable. In this case it does not. Obviously, the data lines do not float high on @toncho11's systems,

On the strict matter of "ISA probing", I believe the above statements to be accurate, though only for IBM PC AT and later systems. That is, to determine if ANY TYPE of card is present on a port on the ISA bus, one must execute an IN instruction for that I/O port address, and, if FF returned, assume not present. For IBM PC 5160 (or possibly XT and earlier), it seems this is not true. That is, a value not equal to FF may be returned when there is NOT ANY card present at that address. The NE2K driver currently executes the following code for its ISA probe: err = ne2k_probe(); printk("eth: %s at 0x%x, irq %d", dev_name, net_port, net_irq); if (err) { printk(" not found\n"); break; } found = 1;

The actual probe routine is in ASM: ne2k_probe:

    // Poke then peek at the base address of the interface.
    // If something is there, return 0.
    // No attempt is made to get details about the i/f.

    mov     net_port,%dx    // command register
    mov     $0x20,%al       // set page 0
    out     %al,%dx
    in      %dx,%al
    cmp     $0xff,%al       // cannot be FF
    jz      np_err
    cmp     $0,%al          // cannot be 0
    jz      np_err
    xor     %ax,%ax
    jmp     np_exit

np_err: mov $1,%ax np_exit: ret

The first thing to recognize here is that the NE2K driver ne2k_probe performs a proper ISA probe (while also rejecting 00 return values, which does not concern us here). However, the NE2K driver only uses the ISA probe to determine whether an NE2K NIC is present. This means that any card plugged in at the netport address that responds not equal to FF will be seen as an NE2K NIC. Therein lies the problem for early XT systems.

and for all I know (need to check the schematics for this) a read to a non existing port on such as bus may hang. It does not sound likely though - it would be bad design.

Since the kernel prints the eth: ne0 at 0x300, irq 12, and does not print not found, we know the code path is such that the "ISA probe" succeeded, and the kernel hang is later in the NE2K driver. Thus, we also know the IBM 5160 is not hanging on an IN or OUT instruction.

SO let's figure out how the early ISA bus behaves when reading or writing nonexistent ports. Gathering such knowledge should take a couple of minutes at most using DEBUG on MSDOS.

Yes - that would be nice to know. But it doesn't change anything, since the the IBM 5160 is obviously returning some value between 1 and 254 inclusive, thus essentially failing the ISA-probe-only mechanism currently used to determine whether a NIC card is present. There should be an additional check that the card set at the netport address is indeed a NIC, rather than say, an IDE controller.

the probe code from Linux increases the chance of IDing the actual interface with some precision

Yes. Both the wd and 3c NIC drivers perform more than a strict ISA probe to determine their chip is present - only the ne driver performs just an ISA probe, with no attempt at chip identification until past the probe routine returning -ENODEV.

It depends on where we want to take ELKS. We're coming from a place where it was expected that the user configures a system to match the hardware.

I feel lucky to have the team of contributors we have on ELKS. @Mellvik, you've been associated with ELKS longer than anyone, and I really appreciate the huge amount of time you've put in testing and enhancing the NIC device drivers. @toncho11 regularly contributes testing and (rare) common-sense comments on the way users will perceive ELKS. I myself like to write software, not just for myself, but for others. Writing software itself is not enough, I want people to run it, without having to be a developer. I want more users to come to our project, download it and give it a try. If the software does not work, I feel many will quietly just leave without saying anything, and never come back - just like a bad restaurant. ELKS itself is advertised as an OS for any 8086 based system, PC, XT, AT and other "nearly compatibles". Many contributors have put in a lot of work to ensure the kernel boots and performs floppy identification and I/O with older BIOSes, including the very first PC BIOS. I feel that our standard distribution needs to work out-of-the-box on all PC, XT and AT hardware, which it does, except now we find for this issue it does not.

I'm not sure the extra code is useful enough to qualify the space it takes.

That could be. We might measure how much code we're talking about. I suspect less than 50 bytes.

We're coming from a place where it was expected that the user configures a system to match the hardware.

If desired, we can remove the network drivers from the default distribution, that will allow booting ELKS for all IBM PC users. However, it would be a shame to not allow others to see ELKS networking in operation, should they have hardware installed. We've put a lot of work into that.

XT and before apparently not fully supported yet and will remain so until someone is sufficiently interested to participate.

XT is supposed to be fully supported. I am willing to write and post a PR with the suggested 8390 chip identification code after the ISA probe, but only tested on QEMU, if desired (or @Mellvik you can, if you would prefer). This will very likely fix this problem, although it could introduce another issue - that of interference with an IDE controller at the same address, being reprogrammed during the 8390 chip identification code. That will have to be tested on @toncho11's system. I am going to read up further on the 8390 counter register and look at the IDE controller register (if any) at address port+0x0d, and will report further information. Thank you!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ghaerr commented 1 year ago

Hello @toncho11,

Regarding the original issue of the kernel hanging during boot, by reading the NE2K driver source I was able to see very likely where this is occurring: after the initial probe, where your PC appears to return a random value other than FF or 00 on port reads from non-existent devices, the NIC address is not rejected and the driver goes ahead and attempts to read the MAC address. The MAC address reading routine, dma_read, uses a kernel busy loop with interrupts disabled, reading from a port address to determine DMA completion. In this case no NIC is present, and the DMA completion bit is never seen, thus hanging forever. The DMA read routine is dual-used for both packet transfers as well as reading the MAC address. Perhaps the routine could be rewritten so as not to hang for the case of reading the MAC address into a local buffer. I still believe a better answer is to enhance the initial probe so that this routine never executes, leaving the point moot. This is the same routine that ended up reading a MAC address of ABABAB... when talking to the XT-IDE controller using my first commit, which failed to reject a probe return of 00.

On a separate note, regarding the idea as to whether having chip identification at boot is a good idea or not, the issue is a bit clouded since although none of the network drivers actually are opened at boot, but are specified and opened later by the /etc/rc.sys, /etc/net.cfg or /bin/net scripts, having a driver falsely identify an 8390 chip as an NE2K versus a 3C NIC doesn't matter. Both drivers could identify "their" NIC at the same address, without much consequence, as long as the kernel doesn't hang during the further information gathering process. However, it is quite convenient for users and developers to see more information at boot as to what the drivers thinks the NIC interfaces are, by displaying the MAC addresses and the specific device model probed and flags configured. Thus, the probe routines do more than just determine NIC present, and that's what ultimately caused the kernel hang.

So, for the time being at least, should your tests for #1508 pass, we will use the original probe, followed by an 8390 chip detect, to work around the apparent problem of early XT machines returning random bus data for non-existent devices to prevent a kernel hang at boot. I will be adding a #define to easily remove the 8390 chip detect, should that become a problem, in which case we'll likely need to fix the dma_read routine in the NE2K driver for reading the MAC address.

Thank you!

toncho11 commented 1 year ago

Thank you @ghaerr and well done! I will test when I get some free time.

ghaerr commented 1 year ago

Hello @toncho11,

Thanks for your testing on three of your older machines. I have finalized PR #1508 which by default uses a slightly different probe result check which will solve this issue, and present very little change to the NE2K NIC identification process, allowing for maximum compatibility with a variety of NE2K NICs, and not hanging on an IBM 5160 XT with no NIC present.

Conclusions:

The NE2K driver hangs in the dma_read routine when no NIC is present on IBM 5160 XT, while attempting to read the MAC address. This is a result of the initial probe allowing any value other than FF or 00 to indicate NIC present.
The NE2K driver should be enhanced to not check for the DMA complete bit when reading into local buffer (MAC address). This change has not been made due to lack of real hardware to test on, but the kernel should not busy loop indefinitely, unless absolutely required.
The IBM 5160 XT appears to respond non-randomly, but incompatibly with the current probe by returning 88 then 30 from ISA port reads at address 0x300. Other systems properly return FF, as their ISA bus seems to operate differently.
A "raw ISA probe" of not writing a port, but only reading, won't work, as QEMU returns 00 when reading from port 0x300 without a command previously being sent. Apparently QEMU waits for a port command to be written before properly setting up its NIC emulation. The 00 return causes the probe to reject a NIC present.
The original NE2K probe checked for FF and 00 returns of an 8390 controller write command probe, accepting NIC present otherwise.
The new NE2K probe allows only for 0x21 and 0x23 returns to accept NIC present, rejecting otherwise. This probe was taken from Donald Becker's 8390 NIC probe, and also works rejecting the original FF and 00 values. It is just a few bytes larger than the original probe.
- An additional "more robust" probe is included, but not turned on by default, which writes the TXCR and CNTR0 registers. This is not turned on due to possible compatibility issues and lack of testing on real hardware.

Thank you!

toncho11 commented 1 year ago

@ghaerr I tested the "tiny" or the "full" probe? The "Tiny" I suppose.

ghaerr commented 1 year ago

I tested the "tiny" or the "full" probe? The "Tiny" I suppose.

Actually, both. This is because the tiny probe looks at just the first byte returned, and rejects a NIC present if not equal to 21 or 23 (your systems returned either FF (correct), 80 (XT no IDE), or 00 (XT w/IDE). The full probe was also present, but is not needed since all will be rejected prior with no NIC. I left in the full probe, off by default, since we tested that at the very first, before seeing that a single byte return would suffice. After testing more on real hardware, the full probe can probably be removed as unneeded.

toncho11 commented 1 year ago

I see. Thank you!

ghaerr / elks

Boot hangs as result of NE2K driver probe fail with no NIC #1500