irods / python-irodsclient

A Python API for iRODS
Other
62 stars 73 forks source link

PRC connects to wrong host when using loadbalancer #617

Closed luijs closed 2 months ago

luijs commented 2 months ago

I am testing a setup where I have a loadbalancer in front of an irods instance. irods runs on host hostname.localdomain.com, on version 4.3.1 I run PRC version 2.0.1, on python 3.8.10 On the loadbalancer we have irods.publicdomain.com, which forwards to hostname.localdomain.com The irods server has a certificate that only has irods.publicdomain.com.

If I now connect via the PRC to irods.publicdomain.com, I get the following error:

KeyError: 'pop from an empty set'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "putfile.py", line 42, in <module>
    create_file_in_irods(session,'/zone/home/user',filename,'one')
  File "putfile.py", line 33, in create_file_in_irods
    irodssession.data_objects.put(filename, "{}/{}".format(collname,filename),
  File "/home/user/.local/lib/python3.8/site-packages/irods/manager/data_object_manager.py", line 200, in put
    with self.open(obj, 'w', **options) as o:
  File "/home/user/.local/lib/python3.8/site-packages/irods/manager/data_object_manager.py", line 430, in open
    conn = directed_sess.pool.get_connection()
  File "/home/user/local/lib/python3.8/site-packages/irods/pool.py", line 17, in method_
    ret = method(self,*s,**kw)
  File "/home/user/.local/lib/python3.8/site-packages/irods/pool.py", line 78, in get_connection
    conn = Connection(self, self.account)
  File "/home/user/.local/lib/python3.8/site-packages/irods/connection.py", line 62, in __init__
    self._server_version = self._connect()
  File "/home/user/.local/lib/python3.8/site-packages/irods/connection.py", line 308, in _connect
    self.ssl_startup()
  File "/home/user/.local/lib/python3.8/site-packages/irods/connection.py", line 210, in ssl_startup
    wrapped_socket = context.wrap_socket(self.socket,
  File "/usr/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/lib/python3.8/ssl.py", line 1069, in _create
    self.do_handshake()
  File "/usr/lib/python3.8/ssl.py", line 1338, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'hostname.localdomain.com'. (_ssl.c:1145)

Trying it on a different machine I get:

irods.exception.NetworkException: Could not connect to specified host and port: hostname.localdomain.com:1247

The pythoncode is:

import os
import ssl

from irods.session import iRODSSession
from irods.models import Collection
from irods.models import DataObject
from irods.column import Criterion
import irods.keywords as kw
import sys

print(sys.argv)
filename = sys.argv[1]

try:
    env_file = os.environ['IRODS_ENVIRONMENT_FILE']
except KeyError:
    env_file = os.path.expanduser('~/.irods/irods_environment.json')

ssl_context = ssl.create_default_context(purpose=ssl.Purpose.SERVER_AUTH, cafile=None, capath=None,cadata=None)
ssl_settings = {"ssl_context": ssl_context,
                'client_server_negotiation': 'request_server_negotiation',
                'client_server_policy': 'CS_NEG_REQUIRE',
                'encryption_algorithm': 'AES-256-CBC',
                'encryption_key_size': 32,
                'encryption_num_hash_rounds': 16,
                'encryption_salt_size': 8}

def create_file_in_irods(irodssession, collname, filename, resource):
    irodssession.data_objects.put(filename, "{}/{}".format(collname,filename),
                                              **{kw.DEST_RESC_NAME_KW: resource, kw.VERIFY_CHKSUM_KW: ''})

with iRODSSession(irods_env_file=env_file,**ssl_settings) as session:
    create_file_in_irods(session,'/zone/home/user',filename,'one')

Somehow PRC seems to be getting the hostname.localdomain.com from irods, and using that to connect instead of the irods.publicdomain.com that I put in my irods_environment.json. I think it should only connect to irods.publicdomain.com instead.

There might be some setting in irods itself that I need to change, but with icommands my setup works as designed, with the PRC it does not.

trel commented 2 months ago

Try... "irods_ssl_verify_server": "cert" in the environment on the client side...

this is a possible workaround and may not address an actual bug.

luijs commented 2 months ago

In this demo that should indeed work. However, if the user cannot access hostname.localdomain.com it will not be able to use the PRC. Also, I am using host access control, where I want to block off all access that does not come via the loadbalancer or hostname.localdomain.com itself.

d-w-moore commented 2 months ago

Sounds like the iCommands do some translation of the connecting hostname for load-balancing purposes. Is this something that the SSL interface or other network interfaces (in absence of an SSL connection) provide at lower level or is it a feature of iRODS server connections that this translation is done? @alanking @korydraughn

luijs commented 2 months ago

I tried another thing with the certificate. It looks like it needs the certificate to be valid for all possible hostnames. I had a certificate that was only valid for hostname.localdomain.com, and then PRC complained that the certificate was not valid for irods.publicdomain.com. I had a certificate only for irods.publicdomain.com, and then it complains that the certificate is invalid for hostname.localdomain.com. When I had a certificate that was valid for both it started to work. However, this is exactly what I am hoping to not do, as I rather not expose the name hostname.localdomain.com.

korydraughn commented 2 months ago

It seems you may have uncovered an issue with client redirection, but until we can confirm, that's only a theory.

What is the size of the file you're uploading?

luijs commented 2 months ago

The file above was just 2KB, but it also went wrong when creating a collection.

korydraughn commented 2 months ago

You mentioned the icommands work.

Please try to upload a 40MB file using iput and let us know what happens.

The file above was just 2KB, but it also went wrong when creating a collection.

Hmm, collections are virtual and do not require redirection. That means it may not be client redirection at play here.

alanking commented 2 months ago

Hmm, collections are virtual and do not require redirection. That means it may not be client redirection at play here.

Perhaps when connected to a catalog service consumer a redirection occurs to a catalog service provider in order to register the collection in the catalog? That's the only situation I can think of where that would happen, though.

trel commented 2 months ago

Easy enough to test? But... wouldn't the consumer do that with a separate server-to-server connection, rather than having the client do that?

korydraughn commented 2 months ago

Correct. The servers redirect to the provider to carry out database operations.

When I speak of client redirection, I'm referring to the PRC's ability to find and connect to the destination resource server (for reads/writes). I don't expect the PRC to perform client redirection to create collections.

iput does a similar thing using the high ports when the size of the transfer exceeds 32MB.

alanking commented 2 months ago

Oh, I see. Carry on!

korydraughn commented 2 months ago

But... wouldn't the consumer do that with a separate server-to-server connection, rather than having the client do that?

Yes.

trel commented 2 months ago

it also went wrong when creating a collection

Right - this is the most interesting thing to investigate at the moment.

@luijs Can you attempt to create a collection and share the PRC code, the client logs, and the server logs?

luijs commented 2 months ago

This is something where every bit of config can change everything, which makes it quite confusing to test and be clear on what the current settings are. so here is another 2 cents.

So, I have to apologize here and come back on an earlier statement though, a collection can be created! I am sorry for the confusion caused here... I was basing that on my test output which has quite some boilerplate, so when I tested again just now with a very simple setup collection creation was fine. Maybe the boilerplate was doing something else that triggered the error. If I find something else there which is not a put I will post it.

I was just testing iput now. I also had problems there before, but it is clear how to solve that. In the /etc/hosts file of the irods server I put a line in: 1.2.3.4(is local known ip of the server) public.hostname.com local.hostname.com

and then iput will connect to public.hostname.com when transferring 40MB (is shown by irods if you use -V) If you do 1.2.3.4 local.hostname.com public.hostname.com and restart the server afterwards(NB, I feel irods will not pick up this change if you don't restart) iput will connect to local.hostname.com, and thus have errors in my case.

For python put however this change makes no difference, I get the local.hostname.com CERTIFICATE_VERIFY_FAILED in both cases. There also seems to be no difference if I send a 40M or a 2KB file with the PRC

korydraughn commented 2 months ago

To confirm there aren't any DNS caching issues, please use the following /etc/hosts settings:

1.2.3.4  public.hostname.com local.hostname.com
  1. Please show a successful upload via iput.
  2. Then immediately attempt to upload another file of the same size using the PRC.

Please use a 40MB file for both uploads.

luijs commented 2 months ago

Tried that, results below: On the server:

myuser@local:~$ sudo nano /etc/hosts
[sudo] password for myuser:
irods@local:/home/WUR/myuser$ irodsctl start

On desktop WSL:

myuser@desktopWSL:~$ iput -KV file40M
From server: NumThreads=4, addr:public.hostname.com, port:20026, cookie=548488194
   file40M                        40.000 MB | 9.813 sec | 4 thr |  4.076 MB/s
myuser@desktopWSL:~$ python3 ~/scripts/putfilesimple.py file40M
file40M
Traceback (most recent call last):
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/pool.py", line 62, in get_connection
    conn = self.idle.pop()
KeyError: 'pop from an empty set'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/myuser/scripts/putfilesimple.py", line 32, in <module>
    create_file_in_irods(session,'/myzone/home/myuser/random1',filename,'one')
  File "/home/myuser/scripts/putfilesimple.py", line 27, in create_file_in_irods
    irodssession.data_objects.put(filename, "{}/{}".format(collname,filename),
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/manager/data_object_manager.py", line 194, in put
    if not self.parallel_put( local_path, (obj,o), total_bytes = sizelist[0], num_threads = num_threads,
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/manager/data_object_manager.py", line 289, in parallel_put
    return parallel.io_main( self.sess, data_or_path_, parallel.Oper.PUT | (parallel.Oper.NONBLOCKING if async_ else 0), file_,
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/parallel.py", line 438, in io_main
    Io = Io()
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/parallel.py", line 49, in __call__
    return self.function(*self.args, **self.keywords)
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/manager/data_object_manager.py", line 430, in open
    conn = directed_sess.pool.get_connection()
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/pool.py", line 17, in method_
    ret = method(self,*s,**kw)
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/pool.py", line 78, in get_connection
    conn = Connection(self, self.account)
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/connection.py", line 62, in __init__
    self._server_version = self._connect()
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/connection.py", line 308, in _connect
    self.ssl_startup()
  File "/home/myuser/.local/lib/python3.10/site-packages/irods/connection.py", line 210, in ssl_startup
    wrapped_socket = context.wrap_socket(self.socket,
  File "/usr/lib/python3.10/ssl.py", line 513, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/lib/python3.10/ssl.py", line 1100, in _create
    self.do_handshake()
  File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'local.hostname.com'. (_ssl.c:1007)

on desktopWSL:

myuser@desktopWSL:~$ ils
/myzone/home/myuser:
  file40M
  C- /myzone/home/myuser/random1

serverlog:

 {"log_category":"agent","log_level":"error","log_message":"[-]\t/irods_source/server/core/src/rodsAgent.cpp:705:int runIrodsAgentFactory(sockaddr_un) :  status [SSL_HANDSHAKE_ERROR]  errno [] -- message [failed to call 'agent start']\n\t[-]\t/irods_source/lib/core/src/sockComm.cpp:160:irods::error sockAgentStart(irods::network_object_ptr) :  status [SSL_HANDSHAKE_ERROR]  errno [] -- message [failed to call 'agent start']\n\t\t[-]\t/irods_source/plugins/network/src/ssl.cpp:764:irods::error ssl_agent_start(irods::plugin_context &) :  status [SSL_HANDSHAKE_ERROR]  errno [] -- message [error calling SSL_accept | error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate]\n\n","server_host":"local","server_pid":2060000,"server_timestamp":"2024-09-12T06:17:37.453Z","server_type":"agent","server_zone":"myzone"}
 {"log_category":"server","log_level":"critical","log_message":"Agent factory returned with error code [-2103000].","server_host":"local","server_pid":2060000,"server_timestamp":"2024-09-12T06:17:37.453Z","server_type":"agent","server_zone":"myzone"}
 {"log_category":"agent_factory","log_level":"error","log_message":"Agent process [2060000] exited with status [1].","server_host":"local","server_pid":2059475,"server_timestamp":"2024-09-12T06:17:37.475Z","server_type":"agent_factory","server_zone":"myzone"}

desktopWSL irods_environment.json:

 {
    "irods_authentication_scheme": "pam_password",
    "irods_client_server_negotiation": "request_server_negotiation",
    "irods_client_server_policy": "CS_NEG_REQUIRE",
    "irods_encryption_algorithm": "AES-256-CBC",
    "irods_encryption_key_size": 32,
    "irods_encryption_num_hash_rounds": 16,
    "irods_encryption_salt_size": 8,
    "irods_host": "public.hostname.com",
    "irods_port": 1247,
    "irods_ssl_verify_server": "hostname",
    "irods_user_name": "myuser",
    "irods_zone_name": "myzone"
}

putfilesimple.py:

 #!/usr/bin/python3
import os
import ssl
import irods.keywords as kw

from irods.session import iRODSSession
import sys

filename = sys.argv[1]
print(filename)

try:
    env_file = os.environ['IRODS_ENVIRONMENT_FILE']
except KeyError:
    env_file = os.path.expanduser('~/.irods/irods_environment.json')

ssl_context = ssl.create_default_context(purpose=ssl.Purpose.SERVER_AUTH, cafile=None, capath=None,cadata=None)
ssl_settings = {"ssl_context": ssl_context,
                'client_server_negotiation': 'request_server_negotiation',
                'client_server_policy': 'CS_NEG_REQUIRE',
                'encryption_algorithm': 'AES-256-CBC',
                'encryption_key_size': 32,
                'encryption_num_hash_rounds': 16,
                'encryption_salt_size': 8}

def create_file_in_irods(irodssession, collname, filename, resource):
    irodssession.data_objects.put(filename, "{}/{}".format(collname,filename),
                                              **{kw.DEST_RESC_NAME_KW: resource, kw.VERIFY_CHKSUM_KW: ''})

with iRODSSession(irods_env_file=env_file,**ssl_settings) as session:
    session.collections.create('/CICDtest/home/luijs002/random1')
    create_file_in_irods(session,'/CICDtest/home/luijs002/random1',filename,'one')

local.hostname.com server_config.json:

 {
    "advanced_settings": {
        "default_log_rotation_in_days": 5,
        "default_number_of_transfer_threads": 4,
        "default_temporary_password_lifetime_in_seconds": 120,
        "delay_rule_executors": [],
        "delay_server_sleep_time_in_seconds": 30,
        "dns_cache": {
            "eviction_age_in_seconds": 3600,
            "shared_memory_size_in_bytes": 5000000
        },
        "hostname_cache": {
            "eviction_age_in_seconds": 3600,
            "shared_memory_size_in_bytes": 2500000
        },
        "maximum_size_for_single_buffer_in_megabytes": 32,
        "maximum_size_of_delay_queue_in_bytes": 0,
        "maximum_temporary_password_lifetime_in_seconds": 1000,
        "number_of_concurrent_delay_rule_executors": 4,
        "stacktrace_file_processor_sleep_time_in_seconds": 10,
        "transfer_buffer_size_for_parallel_transfer_in_megabytes": 4,
        "transfer_chunk_size_for_parallel_transfer_in_megabytes": 40
    },
    "catalog_provider_hosts": [
        "public.hostname.com"
    ],
    "catalog_service_role": "provider",
    "client_api_allowlist_policy": "enforce",
    "controlled_user_connection_list": {
        "control_type": "denylist",
        "users": []
    },
    "default_dir_mode": "0750",
    "default_file_mode": "0600",
    "default_hash_scheme": "SHA256",
    "default_resource_name": "hot_1",
    "environment_variables": {},
    "federation": [],
    "host_access_control": {
        "access_entries": [
            {several entries not mentioned here}
        ]
    },
    "host_resolution": {
        "host_entries": [
            {
                "address_type": "local",
                "addresses": [
                    "public.hostname.com",
                    "1.2.3.4"
                ]
            }
        ]
    },
    "log_level": {
        "agent": "info",
        "agent_factory": "info",
        "api": "info",
        "authentication": "info",
        "database": "info",
        "delay_server": "info",
        "legacy": "info",
        "microservice": "info",
        "network": "info",
        "resource": "info",
        "rule_engine": "info",
        "s3_resource_plugin": "info",
        "server": "info",
        "sql": "info"
    },
    "match_hash_policy": "compatible",
    "negotiation_key": "XXX",
    "plugin_configuration": {
        "authentication": {},
        "database": {
            "postgres": {
                "db_host": "XXX",
                "db_name": "XXX",
                "db_odbc_driver": "PostgreSQL ANSI",
                "db_password": "XXX",
                "db_port": 5432,
                "db_username": "XXX"
            }
        },
        "network": {},
        "resource": {},
        "rule_engines": [
            {
                "instance_name": "irods_rule_engine_plugin-python-instance",
                "plugin_name": "irods_rule_engine_plugin-python",
                "plugin_specific_configuration": {}
            },
            {
                "instance_name": "irods_rule_engine_plugin-irods_rule_language-instance",
                "plugin_name": "irods_rule_engine_plugin-irods_rule_language",
                "plugin_specific_configuration": {
                    "re_data_variable_mapping_set": [
                        "core"
                    ],
                    "re_function_name_mapping_set": [
                        "core"
                    ],
                    "re_rulebase_set": [
                        "core"
                    ],
                    "regexes_for_supported_peps": [
                        "ac[^ ]*",
                        "msi[^ ]*",
                        "[^ ]*pep_[^ ]*_(pre|post|except|finally)"
                    ]
                },
                "shared_memory_instance": "irods_rule_language_rule_engine"
            },
            {
                "instance_name": "irods_rule_engine_plugin-cpp_default_policy-instance",
                "plugin_name": "irods_rule_engine_plugin-cpp_default_policy",
                "plugin_specific_configuration": {}
            }
        ]
    },
    "rule_engine_namespaces": [
        ""
    ],
    "schema_name": "server_config",
    "schema_validation_base_uri": "file:///var/lib/irods/configuration_schemas",
    "schema_version": "v4",
    "server_control_plane_encryption_algorithm": "AES-256-CBC",
    "server_control_plane_encryption_num_hash_rounds": 16,
    "server_control_plane_key": "XXX",
    "server_control_plane_port": 1248,
    "server_control_plane_timeout_milliseconds": 10000,
    "server_port_range_end": 20199,
    "server_port_range_start": 20000,
    "xmsg_port": 1279,
    "zone_auth_scheme": "native",
    "zone_key": "XXX",
    "zone_name": "myzone",
    "zone_port": 1247,
    "zone_user": "XXX"
}

local.hostname.com /etc/hosts:

127.0.1.1       localhost localhost.localdomain
::1             localhost6.localdomain6 localhost6

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

1.2.3.4 public.hostname.com local.hostname.com local

127.0.0.1       dav.localhost  

local.hostname.com hostname command: local

Certificate installed on the irodshost is valid only for public.hostname.com, not for local.hostname.com or local

trel commented 2 months ago

Thank you for all of that.

Just to narrow it a bit more... Please try putfilesimple.py after setting the following in desktopWSL irods_environment.json:

"irods_ssl_verify_server": "cert",

We're suspecting PRC is creating/using the ssl_context slightly differently than the iCommands... but it's not clear yet what that difference is.

luijs commented 2 months ago

weirdly enough that does not change anything, it still gives me the certificate error for local.hostname.com

I am not sure the certificate handling is the issue though, if I install a different certificate that works for both local.hostname.com and public.hostname.com It works normally.

However, if I then disconnect from the VPN, and thus am not able to connect to local.hostname.com anymore, I just get irods.exception.NetworkException: Could not connect to specified host and port: local.hostname.com:1247.

trel commented 2 months ago

Okay... So that suggests that your client machine is trying to make a direct connection to that private machine local.hostname.com.

Does iput also fail without the VPN?

And if so... does it work again if you use iput -N0 to disable the high ports and redirection?

luijs commented 2 months ago

Found it!!

iput succeeds without VPN.

I also did some debugging myself in the PRCcode. The initial connect seems to go fine, only during the put itself things go wrong. Below you see a whole lot of connection stuff, then the collection manager, and then again connection.

Added by me in /home/myuser/.local/lib/python3.10/site-packages/irods/manager/collection_manager.py:: path: /myzone/home/myuser/random1/file40M
Added by me: /home/myuser/.local/lib/python3.10/site-packages/irods/connection.py: address: ('public.hostname.com', 1247)
Added by me: /home/myuser/.local/lib/python3.10/site-packages/irods/connection.py, host: public.hostname.com
Added by me: /home/myuser/.local/lib/python3.10/site-packages/irods/connection.py: self.account.host public.hostname.com
Added by me: /home/myuser/.local/lib/python3.10/site-packages/irods/connection.py: self.account.host public.hostname.com
Added by me in /home/myuser/.local/lib/python3.10/site-packages/irods/manager/collection_manager.py:: path: /myzone/home/myuser/random1
Added by me: /home/myuser/.local/lib/python3.10/site-packages/irods/connection.py: address: ('local.hostname.com', 1247)

In the error stack I noticed this:

  File "/home/myuser/.local/lib/python3.10/site-packages/irods/manager/data_object_manager.py", line 430, in open
    conn = directed_sess.pool.get_connection()

Lines 424-431 of data_object_manager.py show that some redirection took place:

        if redirected_host and use_get_rescinfo_apis:
            # Redirect only if the local zone is being targeted, and if the hostname is changed from the original.
            if target_zone == self.sess.zone and (self.sess.host != redirected_host):
                # This is the actual redirect.
                directed_sess = self.sess.clone(host = redirected_host)
                returned_values['session'] = directed_sess
                conn = directed_sess.pool.get_connection()
                logger.debug('redirect_to_host = %s', redirected_host)

So I then looked for where redirect_host was set and I saw this:

        if allow_redirect and conn.server_version >= (4,3,1):
            key = 'CREATE' if mode[0] in ('w','a') else 'OPEN'
            message = iRODSMessage('RODS_API_REQ',
                                   msg=make_FileOpenRequest(**{kw.GET_RESOURCE_INFO_OP_TYPE_KW:key}),
                                   int_info=api_number['GET_RESOURCE_INFO_FOR_OPERATION_AN'])
            conn.send(message)
            response = conn.recv()
            msg = response.get_main_message( STR_PI )
            use_get_rescinfo_apis = True

            # Get the information needed for the redirect
            _ = json.loads(msg.myStr)
            redirected_host = _["host"]
            requested_hierarchy = _["resource_hierarchy"]

I then realised that the definition of the resource I was writing to still had local.hostname.com in the host. I changed that to public.hostname.com, and then the put finally worked.

As my resource definitions were not updated after moving to a different dns, I guess this might not be a bug, or maybe it is. Or it might be a bug in icommands, since it did not give errors there.

trel commented 2 months ago

Oh - amazing. So the catalog actually still held the local.hostname.com name...

Yes, now I'm wondering how iput was succeeding.

alanking commented 2 months ago

I think we can close this alongside #627. Thoughts?

korydraughn commented 2 months ago

I think I agree given the fact that this issue involves a load balancer.

trel commented 2 months ago

Agreed. Closing.

trel commented 2 months ago

Will mark #627 as duplicate as well.