hercules-390 / hyperion

Hercules 390
Other
252 stars 68 forks source link

Socket problem when using BSC lines in Hercules V4 #115

Open taylorm360 opened 8 years ago

taylorm360 commented 8 years ago

I have been experimenting with RSCS and have built it on two VM/370. I found an old H390-VM forum entry from 2004 which had some configuration stuff, both for RSCS LINK entries and Hercules conf entries. I have tried this in Hercules 3.07/3.12 and the current V4 all produce the same non-blocking error. Note the messages here are from 3.07 the results are the same in V4.

RSCS starts up no problem but the trouble starts when trying to start a BSC link between the systems. At this point I get the following message;

  HHCCA001I 0040:Connect out to 127.0.0.1:7800 failed during initial status :
                 A non-blocking socket operation could not be completed immediately.

The START commands used are;

START SYSTEMB PARM MAS B256 RSCS
START SYSTEMA PARM HAS B256 RSCS

(those commands came from the 2004 forum entry)

My setup is as follows;

Environment:


SYSTEM A:

AXSLINKS:

    GENLINK ID=SYSTEMA,TYPE=DMTSML                             
    GENLINK ID=SYSTEMB,TYPE=DMTSML,KEEP=5,LINE=0B1,TASK=SML1   

CONF File:

    040    2703 dial=NO lhost=localhost lport=7800 rhost=localhost rport=7801

Console Message:

    HHCCA002I 0040:Line Communication thread 0000045C started
    HHCCA005I 0040:Listening on port 7800 for incoming TCP connections
    ...
    MSG FROM RSCS    :        DMTREX000I RSCS (REL 6, LEV 0, 05/13/16) READY
    HHCCA001I 0040:Connect out to 127.0.0.1:7800 failed during initial status :
           A non-blocking socket operation could not be completed immediately.

SYSTEM B:

AXSLINKS:

    GENLINK ID=SYSTEMB,TYPE=DMTSML                             
    GENLINK ID=SYSTEMA,TYPE=DMTSML,KEEP=5,LINE=0B1,TASK=SML1

CONF File:

    040    2703 dial=NO lhost=localhost lport=7801 rhost=localhost rport=7800

Console Message:

    HHCCA002I 0040:Line Communication thread 000008F8 started
    HHCCA005I 0040:Listening on port 7801 for incoming TCP connections
    ...
    MSG FROM RSCS    :        DMTREX000I RSCS (REL 6, LEV 0, 05/13/16) READY
    HHCCA001I 0040:Connect out to 127.0.0.1:7801 failed during initial status : 
           A non-blocking socket operation could not be completed immediately.
ivan-w commented 8 years ago

A couple of things look awkward here !

First is the EWOULDBLOCK errno.. This indicates there unsufficient resources (in this case probably the unability to allocate a transient port number). Could it be a mis-interpretation of a Winsocks returned error ? The correct errno for the succesful initiation of a connect over a non blocking socket is EINPROGRESS.

Second headscratcher is the discrepency with the Remote Port... rport is indicated as being 7800 in the line definition (in the second example) but it is indicated that a connect is attempted to port 7801...

I looked at commadpt.c - but this is code which probably hasn't been changed for over 10 years !

I'll see what happens on a linux host. I don't need RSCS... All that is needed is to issue an enable (0x27) CCW on both sides...

Ivan

taylorm360 commented 8 years ago

Ivan, Sorry for the confusion on the port numbers show in my post. The port number in the error message for SYSTEM A should be 7801 and with the respective port in SYSTEM B message being 7800.

Martin.

taylorm360 commented 8 years ago

Not my day, closed the issue in error, I'm new to GitHub . I have re-opened.

PeterCoghlan commented 8 years ago

I believe this is an issue with read_socket() and write_socket() in hsocket.c. As far as I can tell these have been broken for a very long time now in both the release versions and in hyperion. My understanding (which may need updating) is as follows:

socket(), read_socket(), write_socket() and close_socket() were originally (back in the version 3.04.1 timeframe) defined as macros in hmacros.h whose purpose was to provide a platform independent way of dealing with tcp/ip sockets. By version 3.06 the original purpose had been forgotten, a new function for them had been invented for them and they had been converted into functions in hsocket.c.

The new function invented for read_socket() was to return a given number of bytes, whether they were available from the network and keep trying to get them from the network even thought they don't appear. This is at odds with the way non-blocking sockets should work in a multithreaded application like Hercules.

I believe the EWOULDBLOCK error is the tcp/ip stack saying there is no more data available and if this was a blocking socket, it would now block and wait for more data to arrive. I think this is an expect condition which is incorrectly being treated as an error.

As far as I recall, the way write_socket() was implemented, it ends up losing information for the caller about exactly how many bytes have been written to the network at the point when buffers become full, making it impossible for the caller to know how much to retry later once the buffer full condition is reached.

I made some attempts to fix these functions but always ended up needing more changes than I was able to cope with. I think the use of them throughout Hercules needs to be carefully examined and a proper solution thought out.

In the meantime, I have made some workarounds in commadpt.c on my system which avoid the use of read_socket() and write_socket() in order to get RSCS to work correctly for me.

I haven't looked at this lately but I think it is the case that some of the sockets used in Hercules are blocking sockets (but used for applications which typically don't result in them blocking) and some are non-blocking. I suspect we really should have all sockets set to non-blocking mode and recode in such a way that we never try to wait on them receiving code from the network.

PeterCoghlan commented 8 years ago

After some investigating and reading the original posting more carefully, it appears I am mixing this problem up with a different issue which does not arise until after the connection is established. Sorry about that.

Regarding the reported issue, the online help on my system which is neither Windows nor linux but whose networking contains features common to both, lists the following two among the errors returned by connect():

EINPROGRESS O_NONBLOCK is set for the file descriptor for the socket, and the connection cannot be immediately established; the connection will be established asynchronously.

EWOULDBLOCK The socket is nonblocking, and the connection cannot be completed immediately. It is possible to use the select() function to select the socket for writing.

I am not sure if there is significance in the difference between the connection being established and the connection being completed but it sounds like they amount to much the same thing from a point of view of what to do when they occur.

Perhaps a check needs to be made for either of these returns after calling connect()? I don't think there would be any adverse effects of doing this even on a system which only ever returns only one of them.

n00dle042 commented 5 years ago

Similar issue, perhaps, using 4.0 on debian linux -- the TCP layer connection comes up between 2703s (whether localhost different ports or different hosts altogether) but when I START the lines, nothing actually goes across at all. More information to come...

PeterCoghlan commented 5 years ago

n00dle042, did you resolve your issue?

If you've got past that one, there are also some bugs in DMTSML in VM/370 RSCS which will cause problems further on. I have fixes for some of them if you want them.

n00dle042 commented 5 years ago

I have worked around the issue by passing the endpoints through socat, as helpfully pointed out by another interested party. Are the fixes you mentioned in the SixPack 1.3 beta 3?

PeterCoghlan commented 5 years ago

My fixes are not in SixPack 1.3 beta 3 (there are no updates to RSCS in the Sixpack betas). They will be available on my website when I get them organised properly. The are part of larger updates to implement NJE in VM/370 RSCS which complicates things a bit. I can email you a rough and ready version if you like.

n00dle042 commented 5 years ago

I would. I was about to dive in and see if I could implement NJE myself based on HUJI code and my reading of DMTSML.ASSEMBLE. :D This would help me and maybe I can help the project in general.

Cheers, Chris.

PeterCoghlan commented 5 years ago

I can't find an email address for you. Can you send an email to vm370 at beyondthepale.ie please?

PeterCoghlan commented 5 years ago

Hi Chris,

I've received two emails from you. I've replied to both and your mail provider has accepted my replies. It doesn't look like you are seeing them though. Can you check any spam filtering etc at your end?