Race condition in Hercules networking (LCS, etc)

hercules-390 / hyperion

Hercules 390

Other

252 stars 70 forks source link

Race condition in Hercules networking (LCS, etc) #43

Closed ghost closed 9 years ago

ghost commented 9 years ago

Florian reported:

I thought it was fixed because I didn't face this problem for quite some time.

Host is openSuse 13.1 and guest OS is z/VM 5.4. LCS is normally quite stable, however starting the TCP/IP stack can take several tries to establish correct working.

Opening of TAP is OK and also there is no explicit error under VM but NO communication is possible. Only shutting down z/VM, stopping Hercules, exiting the shell from which Hercules started and then restart everything solve the problem. Sometimes two, three or more restarts are necessary, which is annoying.

When it works, there is message HHC00942I shown on the console. If this message is not showing up the communication won't work. That is repeatable.

This bug is there since quite a long time and to me gives the impression of a race condition.

I understand. And yes, that certainly does sound like a race condition. Possibly this one, which I documented a long time ago but have not had time to address (fix):

https://groups.yahoo.com/neo/groups/zHercules/conversations/messages/15132

The above explains the problem from a z/OS perspective, but I believe it could very likely (indeed, very probably) impact/affect other guest operating systems too (such as z/VM).

Basically the LCS device initialization involves several "command frames" (device i/o request/response packets) being sent from the guest (z/VM, z/OS, etc) to the LCS device (hardware) which the LCS device then "replies" to back to the guest.

The command frames -- of which there are several and which together constitute an "initialization handshake sequence" -- are of course handled by Hercules's LCS device handler logic which is of course emulating the hardware.

But I believe these command frames are being sent out (from the guest o/s to Hercules) in a completely arbitrary sequence that only happens by pure coincidence to be USUALLY in the same sequence MOST of the time that Hercules's LCS logic expects. That is to say, Hercules LCS handler expects command frame 'A' to be sent first, followed by command frame 'B' and then 'C', etc, so that is how Hercules is coded to expect them.

Sometimes however, the command frames are sent (and thus arrive) in a different sequence. A sequence that Hercules's LCS device handler does not expect. When that happens, Hercules ends up waiting for what it believes should be the last/final command frame to arrive before it can enable/activate the device so normal packet flow can begin. But since that packet (that Hercules is waiting on) has already been sent, Hercules ends up waiting forever for a command frame which will never arrive! (The guest already sent that command frame, but it sent it when Hercules was not expecting it so Hercules ignored it or missed it.)

This of course is just a guess based on reported behavior. It was one of the things we noticed occurring during the development/testing of the QETH/QDIO device driver: the command frame wouldn't always arrive in the expected order/sequence.

It is a known flaw in the design of not only the LCS device handler but also the QETH device handler as well. We talked about trying to separate the lower level command-frame handling logic and host device handling logic from the higher level device emulation logic in Hercules so that all Hercules networking device handlers could share the same code (so that we would only have to fix such problems in one place instead of two or more), but nothing ever came from it. No one (myself included) has found the time to work on it yet.

ghost commented 9 years ago

Florian:

I need you to try your tests again with the latest Hyperion.

I committed some changes to the LCS debugging logic today and need you to try your tests again and then send me your resulting Hercules logs.

Please be sure to enable the debug flag on your LCS device statements. ("-d" device statement option)

I need to see one or more successful runs, and one or more unsuccessful runs. I would like to see the different debug messages for each. Just attach them to an email and send them to me. One (or more!) good runs, and one (or more!) bad runs. Thanks.

I also have z/VM too (5.3 Evaluation Edition), but I cannot remember how to configure it for LCS devices. If you (or someone else reading this!) could also help me configure my z/VM to use LCS devices, then I would also be able to debug the problem myself too, without your help.

Since you are able to cause the problem yourself however, it would be faster if you could simply try the latest Hyperion and then send me your results.

Thanks.

Fish

ghost commented 9 years ago

Florian:

I received your success and failure Hercules logs. Thank you.

I see something in the LCS code that might explain what I'm seeing in your log files, but I am not sure whether it's the problem or not. I need more data to see if there is a pattern.

Please send me 3 or 4 more success cases, and 3 or 4 more failure cases.

Thanks!

Fish-Git commented 9 years ago

Florian Bilek wrote:

I did my regular update of the openSUSE to activate some patches provided by SuSE and then restarted Hercules, there was absolutely NO initialisation of LCS. So at this run, I did not saw any trace entries. Of course the LCS did not work.

So I shut down z/VM and Hercules. When you look at the log at 20:12:12, after I had initiated the shutdown of z/VM Hercules reported that it had started LCS thread.

The following IPL initiated the LCS properly. and it worked immediately.

I have the impression that under some conditions there is no initialisation of the LCS at all which would explain that the TCP/IP stack does not work properly. Maybe that is exactly what you mentioned that the command sequences came in different orders.

I attach you the trace file to this message. Maybe that help you.

Yes, I see that. Thank you. The LCS read thread restarted itself -- during SHUTDOWN. (It should only be doing that during STARTUP, not during shutdown!)

As I believe I mentioned in a previous message, I do see a bug which might explain such behavior, but I am also seeing something else very similar that may also be related to the problem too. Thus I would like to continue gathering evidence to try and prove the problem is where I think it is.

I do not like "shooting in the dark" (i.e. guessing). I like to be certain. I want to be able to clearly see the bug so I can be confident I am fixing the actual problem and not just crossing my fingers and hoping I might MAYBE be fixing the problem. I want to be sure.

Since the bug seems to be a race condition (i.e. timing related), I am going to try to develop a faster PTT trace-table debugging technique for the LCS driver rather than the much slower message-based debugging technique we are currently trying to use. Hopefully that will help us to more easily "catch" the bug in action. I've already started making some changes but I am not done yet. When I am done I will send you a new patch to try. Hopefully it will make it easier for the bug to occur (hopefully it will make it easier for us to catch (see) the bug.)

Unfortunately however I am in the middle of working on a much more pressing problem with my own software that a customer is having problems with. Once I finish I will return to working on your problem again.

While you are waiting, please keep sending me any failures you experience. Every bit of evidence helps.

Thank you for your patience. I have not forgotten about you. I am simply temporarily being delayed due to other circumstances. We WILL figure this out! I promise.

Thanks.

Fish-Git commented 9 years ago

(BCC: direct to Florian)

Fish wrote:

Since the bug seems to be a race condition (i.e. timing related), I am going to try to develop a faster PTT trace- table debugging technique for the LCS driver rather than the much slower message-based debugging technique we are currently trying to use.

Florian:

Commit 31bbcb22c3c874085fe3705351b516d10ead13e0 contains the mentioned enhancement to LCS debugging that I'm hoping will make it easier for you to recreate the problem you are experiencing. Please give it a try.

To use it, REMOVE your existing "-d" option from your LCS device statements, and instead ADD the following 'ptt' statement to the System Control Statements section of your configuration file (i.e. anywhere BEFORE your device statements):

ptt lcs2 10000

That will activate PTT tracing for your LCS devices as soon as your configuration file is opened. (You could enter the command manually any time after Hercules has been powered on and before you IPL, but it is best to have it started BEFORE your LCS devices are added. Thus the reason for inserting it directly into your configuration file.)

Then simply IPL z/VM as normal and try to recreate your problem. HOPEFULLY it will be easier to do since the "-d" option will not be enabled. (The -d debug LCS device statement option causes log messages to be issued which slows things down too much and changes the overall timing of things. That's not a good idea when chasing a race condition! Thus the reason for switching to much faster "PTT" table based event logging tracing instead.)

When your problem occurs, enter the 'ptt' by itself (without any arguments) to dump the PTT trace table entries. Try to capture both a successful run as well as a failing run so I can compare the two. Thanks.

Contact me if you have any questions or problems.

Fish-Git commented 9 years ago

Hello Fish,

Thank you for your efforts. I will keep you informed.

Kind regards, Florian

Fish-Git commented 9 years ago

(CC: Harold)

Florian Bilek wrote:

Subject: LCS - race condition - PPT trace

Fish,

Please find hereby a zip with the ptt traces of a good and a bad situation. The bad situation was the first. After compiling and configuring Hercules I started it. The first IPL didn't start the LCS. After the "normal" procedure, shutting down everything and restart, the LCS got active.

Thank you. Your "bad" trace is both interesting and disappointing at the same time.

It is interesting because there is NO evidence whatsoever that z/VM ever issued a read OR write to the LCS device AT ALL! (which is, to say the least, HIGHLY UNUSUAL!)

At the same time, it is also disappointing too, precisely because of this lack of evidence. I was very much hoping to see a write request (for the "Start LAN" LCS command) followed by a read request that was failing due to the race condition. Unfortunately because of the "missing evidence", the results of the test are inconclusive. That is to say, based on the unfortunate LACK of evidence, we cannot logically draw any conclusion from it. (Logical conclusions regarding a given phenomenon cannot validly be reached based on LACK of evidence.)

There is a chance (possibility) that the evidence DID, at one time, exist within the PTT trace table, but was overlaid by subsequent trace events. That is to say, there is a chance our specified PTT trace table size is too small. You may need to try again, but specifying 50000 as the table size instead (or possibly even 100000 or more (250000?)): e.g. ptt lcs2 250000

Some important notes:

Please do NOT use the "log" command to create separate Hercules log files! I would rather see EVERYTHING together as one continuous log file, from Hercules start to Hercules exit.
Hercules should be built WITH Rexx support (--enable-object-rexx or --enable-regina-rexx) (I prefer object rexx myself), and the "ptt" command should then be added to the beginning of your Hercules configuration file.

While you are getting yet more traces() for me, I am going to try writing a stand-alone program (using the ASMA tool of Harold's fantastic "SATK"! (Stand-alone Tool Kit(*)) to perform I/O to an LCS device in an attempt to try and recreate your problem for myself on my system here at home. (I have not yet been able to get TCP/IP working on my z/VM 5.3 system yet. I was working last year or the year before, but I did something to break it and I'm unable to figure out what it is or how to fix it. I've forgotten too much.)

I'm rather excited about attempting it since this will be the first time I've actually tried using Harold's SATK product. Until now I haven't had time to play with it nor a valid reason. Now I do! I'm hoping it'll be as much fun as it looks. If I'm successful, then I won't need to bother you any longer, Florian, as I will then be able to get my own traces and resolve the problem myself without your help (which is better for both you and me and also much faster too!).

Harold? I'm counting on your help in case I have any questions or problems, buddy!

So... Please TRY to get me some more traces, Florian (Pretty Please?), and I promise to TRY and get a stand-alone program written to recreate your problem on my own system here at home. Fair enough?

As always, I thank you for the incredible amount of patience you have shown throughout this ordeal and thank you in advance for continuing it.

I'll let you know when I have anything to report and I trust you will do the same.

Thanks.

"Fish" (David B. Trout) Software Development Laboratories http://www.softdevlabs.com mail: fish@softdevlabs.com

(*) I'm sorry to keep asking you for more traces and more traces and try again and try again, but I am trying very hard to PROVE the bug I believe I am seeing in the code is "THE bug" (i.e. is the bug that is causing your problem). I hope you understand. I don't like guessing. I like KNOWING.

(**) https://github.com/s390guy/SATK

Fish-Git commented 9 years ago

(ATTACHMENT)

Fish wrote:

[...]

While you are getting yet more traces() for me, I am going to try writing a stand-alone program (using the ASMA tool of Harold's fantastic "SATK"! (Stand-alone Tool Kit(*)) to perform I/O to an LCS device in an attempt to try and recreate your problem for myself on my system here at home. [...] I'll let you know when I have anything to report and I trust you will do the same.

Thanks.

Please try the attached patch.

It's a shot in the dark, but I'm hoping it might fix your problem.

I am going to continue with my effort to try and reproduce your problem using Harold's SATK, but since it might take me a long time, we might as well try SOMETHING. (A shot in the dark is better than no shot at all!)

Please apply it to current git, and let me know if it fixes your problem or not.

Thanks!

Fish-Git commented 9 years ago

Great, it helped. Thank you.

I just started z/VM and LCS worked immediately.

I am happy to hear that.

I have attached the log so you could check if the patch is working correctly. PTT is on during the run.

Thank you. It looks fine.

It is of course impossible to prove the bug has been fixed with 100% certainty. We can only "prove" it has been fixed to a very high degree of confidence (e.g. 99.99%+) by trying your test many, many times without the bug occurring.

So you need to keep the PTT active for the next several (days? weeks?) so we can catch the bug in case it has not been fixed yet. if the bug does not occur after several (days? weeks?), then we can confidently say we are 99.999% confident the bug has been fixed.

If the bug DOES happen again (I hope it does not!), then please send me the PTT trace immediately!

If the bug does NOT happen again after several (days? weeks?), THEN... you need to disable to PTT trace and keep trying again for another several (days? weeks?). Do you understand?

We need to continue trying under TWO conditions:

For several days/weeks with ptt ENABLED.
For another several days/weeks with ptt DISABLED.

Only after step 2 can we be sure the bug has been fixed.

Thanks.

Please keep me informed.

Fish-Git commented 9 years ago

Fish,

OK, I will do so you told. Hope this bug is now solved. It is there since a long time.

Thanks for you support.

Kind regards, Florian

Fish-Git commented 9 years ago

Thank you for your recent logs. They were very helpful.

Based on the evidence they provide I am now convinced to a very high degree of certainty that the problem is not Hercules related. The evidence points strongly to a timing related bug within z/VM 5.4 (or your installation/configuration/customization of it).

I will commit my patch and close this issue.

Thank you again for your cooperation and patience.

fbi-ranger commented 9 years ago

Hi Fish,

Thank you for the info. Is it then not possible to do a serialisation of the processes of what ever the cause of it might be? What do you think should I investigate in the setup? Any hints?

I think it is a good idea to commit the patches since they do no harm and I have the impression that the situation does not happen so much any more as without them.

Could the problem maybe related to compiling options of Hercules? I use some specific optimizing options of GCC. They are from the past as the machine I had previously was not as fast as the modern machines now a day. I need performance for the guest systems. Therefore any optimization helps.

Thank you.

Florian

Fish-Git commented 9 years ago

Florian wrote:

Hi Fish,

Thank you for the info. Is it then not possible to do a serialisation of the processes of what ever the cause of it might be?

If we knew what the cause was, yes, it might be possible.

What do you think should I investigate in the setup? Any hints?

Yes. It has something to do with the Cross System Extensions (CSE) area on volume VMS19E (device 5102).

It each of the last two failures, there was a CSE I/O error logged on that volume by z/VM:

HCPXLE2890E CSE I/O error, code 01, volume VMS19E, device 5102; LINK commands may fail.

The fact that there was no corresponding I/O error logged by Hercules too, provides strong evidence of a problem within z/VM. (In each "good" run, the error message is not there.)

I don't know anything at all about CSE or SSI (Single System Image clusters), but you might want to investigate the following links I found on IBM's web page. They recommend using SSI and not CSE:

HCP2890E CSE I/O error... Setting Up Cross System Extensions (CSE) CSE Area

It is also possible that the CSE area on volume VMS19E (device 5102) might somehow be damaged. Is there a way to rebuild it?

Location of the CSE Area Default Values for the CSE Area Format of the CSE Area on an ECKD Volume

In addition to the HCPXLE2890E z/VM error message, there is also a subtle difference in the sequence of z/VM's network initialization/startup too. It appears certain networking subsystems/components (such as the DNS server) are being started too soon or too late, so you might want to investigate each startup proc (vm userid profile exec?) involved in z/VM's network initialization and introduce some delays to ensure networking always starts in the exact same sequence every time.

I think it is a good idea to commit the patches since they do no harm and I have the impression that the situation does not happen so much any more as without them.

Agreed. I already committed the change several days ago but forgot to push it to the GitHub Hyperion repository due to other distractions. I will try to remember to do that today.

Could the problem maybe related to compiling options of Hercules?

I doubt it, but I suppose anything is possible. I cannot remember for sure right now, but I believe we have seen certain versions of gcc cause problems with Hercules in the past? Yes?

I use some specific optimizing options of GCC.

Which are?

They are from the past as the machine I had previously was not as fast as the modern machines now a day. I need performance for the guest systems. Therefore any optimization helps.

What optimization options are you using?

Thank you.

Florian

You are very welcome.

Good luck!

fbi-ranger commented 9 years ago

Hi Fish,

I use the following GCC options:

-fomit-frame-pointer -O3 -fno-strict-aliasing -ggdb3 -march=native -mfpmath=sse -msse4.1 -fexpensive-optimizations"

But it would be strange that it has only side effects on certain functionalities such as LCS.

Kind regards, Florian