ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.42k stars 544 forks source link

fix(strings): make `StringValue.capitalize()` consistent across backends #8270

Closed NickCrews closed 4 months ago

NickCrews commented 4 months ago

Fixes #8271.

BREAKING CHANGE: Backends that previouslly used initcap (analogous to str.title) to implement StringValue.capitalize() will produce different results when the input string contains multiple words (a word's definition being backend-specific).

NickCrews commented 4 months ago

@jcrist I wrote this when I thought the intended behavior was like .title(). But I think you are right, let me change these test cases.

NickCrews commented 4 months ago

Wow, after the sqlglot refactor this is SO much cleaner. Love it, thanks so much for your work on that big rewrite!

NickCrews commented 4 months ago

This might deserve a callout fairly prominently in the release notes, because although it might technically be a fix, that is only due to a poor definition of what capitalize was supposed to do. I bet some users' results are going to change as a result of this. There was even a test in the postgres backend testing for the old semantics that I had to deleted.

NickCrews commented 4 months ago

It appears that exasol treats || as equivalent to concat(), which is not ANSI standard SQL when you have NULL || 'abc': You should get NULL, but exasol gives 'abc'. We already have a workaround for this in ops.StringConcat compiler for the exasol backend. Maybe we should redo this implementation to not compile directly with the DPipe, but instead rewrite the entire op.Capitalize in terms of op.StringConcat, op.Upper, etc ?

NickCrews commented 4 months ago

I filed a support ticket with exasol and asked them to chime in here. Curious if they will fix their implementation, or if we will need to have a custom workaround forever.

NickCrews commented 4 months ago

@jcrist @cpcloud OK I'm a little stumped as to the reason for the failing tests, since they should be using StringConcat and Substring, which both look tested on these backends. I can't run any of these backends locally since I don't have the creds. Do you think either of you could finish this up?

NickCrews commented 4 months ago

also, it appears that many backends have that "bug" that exasol does, so maybe it's not a bug.

gforsyth commented 4 months ago

I can't run any of these backends locally since I don't have the creds.

Except for BigQuery and Snowflake, where there are real credentials that are necessarily secret, you can find the credentials for all the backends in their respective conftest.py files. I'll often run the test suite with a pytest -m exasol and then kill it after the first test passes, so that the test startup has a chance to load the test data into the backend container.

NickCrews commented 4 months ago

I try just up and see

$ just up
docker compose up --wait 
[+] Building 0.0s (0/0)                                                                                                                   docker:desktop-linux
[+] Running 17/18
 ✔ Container ibis-flink-jobmanager-1   Created                                                                                                            0.0s 
 ✘ Container ibis-minio-1              Error                                                                                                              0.0s 
 ✔ Container ibis-statestored-1        Started                                                                                                            0.0s 
 ✔ Container impala-hive-metastore     Created                                                                                                            0.0s 
 ✔ Container ibis-mysql-1              Created                                                                                                            0.0s 
 ✔ Container druid-postgres            Created                                                                                                            0.0s 
 ✔ Container ibis-kudu-1               Started                                                                                                            0.0s 
 ✔ Container ibis-postgres-1           Running                                                                                                            0.0s 
 ✔ Container ibis-mssql-1              Created                                                                                                            0.0s 
 ✔ Container ibis-hive-metastore-db-1  Created                                                                                                            0.0s 
 ✔ Container zookeeper                 Created                                                                                                            0.0s 
 ✔ Container ibis-exasol-1             Created                                                                                                            0.0s 
 ✔ Container ibis-clickhouse-1         Running                                                                                                            0.0s 
 ✔ Container ibis-flink-1              Created                                                                                                            0.0s 
 ✔ Container ibis-hive-metastore-1     Created                                                                                                            0.0s 
 ✔ Container coordinator               Created                                                                                                            0.0s 
 ✔ Container ibis-trino-1              Created                                                                                                            0.0s 
 ⠦ Container ibis-oracle-1             Starting                                                                                                           0.5s 
Error response from daemon: mkdir /var/lib/docker/overlay2/9e12c9117dc506231210816af71282562ba269c4bd946ce870c4ad65316c75cc/merged: no space left on device
error: Recipe `up` failed on line 101 with exit code 1

so I am ignoring the minio error (maybe a separate issue?), since exasol looks to be working, and then try pytest -m exasol -k capitalize ibis/backends/tests/test_string.py

and then get a bunch of

__________________________________________________ ERROR at setup of test_capitalize[exasol-aBc1dEf-Abc1def] __________________________________________________

self = <ExaConnection session_id= dsn=localhost:8563 user=sys>

    def _init_ws(self):
        """
        Init websocket connection
        Connection redundancy is supported
        Specific Exasol node is randomly selected for every connection attempt
        """
        dsn_items = self._process_dsn(self.options['dsn'])
        failed_attempts = 0

        ws_prefix = 'wss://' if self.options['encryption'] else 'ws://'
        ws_options = self._get_ws_options()

        for hostname, ipaddr, port, fingerprint in dsn_items:
            self.logger.debug(f"Connection attempt [{ipaddr}:{port}]")

            # Use correct hostname matching IP address for each connection attempt
            if self.options['encryption']:
                ws_options['sslopt']['server_hostname'] = hostname

            try:
>               self._ws = websocket.create_connection(f'{ws_prefix}{ipaddr}:{port}', **ws_options)

.venv/lib/python3.11/site-packages/pyexasol/connection.py:667: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.11/site-packages/websocket/_core.py:646: in create_connection
    websock.connect(url, **options)
.venv/lib/python3.11/site-packages/websocket/_core.py:256: in connect
    self.sock, addrs = connect(
.venv/lib/python3.11/site-packages/websocket/_http.py:141: in connect
    sock = _open_socket(addrinfo_list, options.sockopt, options.timeout)
.venv/lib/python3.11/site-packages/websocket/_http.py:228: in _open_socket
    raise err
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

addrinfo_list = [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 8563))], sockopt = [], timeout = 10

    def _open_socket(addrinfo_list, sockopt, timeout):
        err = None
        for addrinfo in addrinfo_list:
            family, socktype, proto = addrinfo[:3]
            sock = socket.socket(family, socktype, proto)
            sock.settimeout(timeout)
            for opts in DEFAULT_SOCKET_OPTION:
                sock.setsockopt(*opts)
            for opts in sockopt:
                sock.setsockopt(*opts)

            address = addrinfo[4]
            err = None
            while not err:
                try:
>                   sock.connect(address)
E                   ConnectionRefusedError: [Errno 61] Connection refused

.venv/lib/python3.11/site-packages/websocket/_http.py:205: ConnectionRefusedError

During handling of the above exception, another exception occurred:

request = <SubRequest 'backend' for <Function test_capitalize[exasol-Abc-Abc]>>, data_dir = PosixPath('/Users/nc/code/ibis/ci/ibis-testing-data')
tmp_path_factory = TempPathFactory(_given_basetemp=None, _trace=<pluggy._tracing.TagTracerSub object at 0x152821510>, _basetemp=PosixPath...var/folders/0b/tpgpsvx54f9g1gg1lkq45syc0000gp/T/pytest-of-nc/pytest-164'), _retention_count=3, _retention_policy='all')
worker_id = 'master'

    @pytest.fixture(params=_get_backends_to_test(), scope="session")
    def backend(request, data_dir, tmp_path_factory, worker_id) -> BackendTest:
        """Return an instance of BackendTest, loaded with data."""

        cls = _get_backend_conf(request.param)
>       return cls.load_data(data_dir, tmp_path_factory, worker_id)

ibis/backends/conftest.py:434: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ibis/backends/tests/base.py:158: in load_data
    inst = cls(data_dir=data_dir, tmpdir=tmpdir, worker_id=worker_id, **kw)
ibis/backends/tests/base.py:100: in __init__
    self.connection = self.connect(tmpdir=tmpdir, worker_id=worker_id, **kw)
ibis/backends/exasol/tests/conftest.py:49: in connect
    return ibis.exasol.connect(
ibis/__init__.py:107: in connect
    return backend.connect(*args, **kwargs)
ibis/backends/base/__init__.py:864: in connect
    new_backend.reconnect()
ibis/backends/base/__init__.py:874: in reconnect
    self.do_connect(*self._con_args, **self._con_kwargs)
ibis/backends/exasol/__init__.py:87: in do_connect
    self.con = pyexasol.connect(
.venv/lib/python3.11/site-packages/pyexasol/__init__.py:68: in connect
    return ExaConnection(**kwargs)
.venv/lib/python3.11/site-packages/pyexasol/connection.py:184: in __init__
    self._init_ws()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <ExaConnection session_id= dsn=localhost:8563 user=sys>

    def _init_ws(self):
        """
        Init websocket connection
        Connection redundancy is supported
        Specific Exasol node is randomly selected for every connection attempt
        """
        dsn_items = self._process_dsn(self.options['dsn'])
        failed_attempts = 0

        ws_prefix = 'wss://' if self.options['encryption'] else 'ws://'
        ws_options = self._get_ws_options()

        for hostname, ipaddr, port, fingerprint in dsn_items:
            self.logger.debug(f"Connection attempt [{ipaddr}:{port}]")

            # Use correct hostname matching IP address for each connection attempt
            if self.options['encryption']:
                ws_options['sslopt']['server_hostname'] = hostname

            try:
                self._ws = websocket.create_connection(f'{ws_prefix}{ipaddr}:{port}', **ws_options)
            except Exception as e:
                self.logger.debug(f'Failed to connect [{ipaddr}:{port}]: {e}')

                failed_attempts += 1

                if failed_attempts == len(dsn_items):
>                   raise ExaConnectionFailedError(self, 'Could not connect to Exasol: ' + str(e))
E                   pyexasol.exceptions.ExaConnectionFailedError: 
E                   (
E                       message     =>  Could not connect to Exasol: [Errno 61] Connection refused
E                       dsn         =>  localhost:8563
E                       user        =>  sys
E                       schema      =>  
E                       session_id  =>  
E                   )

.venv/lib/python3.11/site-packages/pyexasol/connection.py:674: ExaConnectionFailedError

I am on an M1 mac, is that an issue? Am I doing something else wrong?

cpcloud commented 4 months ago

You won't be able to run every service on macos. Some of these services use Linux-specific APIs.

gforsyth commented 4 months ago

Am I doing something ... wrong?

I am on an M1 mac

So yeah, as Phillip mentioned, not all of these work out of the box on mac. I believe @ncclementi has at least some of them working using colima and is planning to write up some instructions on how to do it (not sure if that will cover exasol or not, but maybe?)

NickCrews commented 4 months ago

OK, thanks. Do you think either of you could finish this PR up then in the meantime?

ncclementi commented 4 months ago

I got the postgress one up and running, and I was able to run tests. But I was trying to see if I could get up the exasol container to run the tests but I didn't make it pass. I couldn't get the container up and healthy.

I tried also using colima's virtualization but I get

container ibis-exasol-1 is unhealthy
error: Recipe `up` failed on line 101 with exit code 1

When I check the logs I get

[2024-02-14 20:54:39] child 757 (membership) returned with state 1.
Wed Feb 14 20:54:42 CET 2024: start script '/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/sbin/cos_sync_files'.
Wed Feb 14 20:54:42 CET 2024: pack files for sync.
exa/etc/dwad/db_DB1_ca.pem
exa/etc/dwad/db_DB1_cert.pem
exa/etc/dwad/db_DB1_key.pem
No nodes to synchronize.
Wed Feb 14 20:54:46 CET 2024: start script '/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/sbin/cos_sync_files'.
Wed Feb 14 20:54:46 CET 2024: pack files for sync.
exa/etc/dwad/db_DB1.conf
No nodes to synchronize.
Traceback (most recent call last):
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/libexec/exainit.py", line 132, in <module>
    main(sys.argv)
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/libexec/exainit.py", line 124, in main
    run(args[1])
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/libexec/exainit.py", line 92, in run
    serv.run_result = serv.run()
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/lib/exainit/tasks/database.py", line 135, in run
    self._db_setup_and_start (db, dbname, dbpars, True)
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/lib/exainit/tasks/database.py", line 117, in _db_setup_and_start
    self.start(db)
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/lib/exainit/tasks/database.py", line 84, in <lambda>
    return lambda db, *a, **kw: self._wrappercall(getattr(db, name), *a, **kw)
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/lib/exainit/tasks/database.py", line 90, in _wrappercall
    raise RuntimeError("Could not call function %s(*%s, **%s): %s,%s" % (repr(func), repr(args), repr(kw), repr(ret), repr(val)))
RuntimeError: Could not call function <built-in method start of libDWAd_py3.exa_db object at 0x7fe863c5a830>(*(), **{}): 1,'Could not start database: system does not have enough active nodes or DWAd was not able to create startup parameters for system'
[2024-02-14 20:54:48] root child 235 (exainit.py) returned with state 1.

any idea what is wrong?

Maybe this is because exasol only supports linux, and their image says https://github.com/EXASOL/docker-db?tab=readme-ov-file#host-os

NickCrews commented 4 months ago

hmm, it seems to me that MacOS is the most likely cause of exasol not working, and doesn't look like there are easy workarounds.

ncclementi commented 4 months ago

Yes, it seems people have been asking for a while for support of the exasol image for M1 but no answers. see https://github.com/exasol/docker-db/issues/63

I bypass the error above providing more memory but the container is always "unhealthy", I hit this same error https://github.com/exasol/docker-db/issues/77 (it's been open for a while) and not sure what's the reason behind.

cpcloud commented 4 months ago

The amount of additional code generated here doesn't seem worth the trouble.

For Oracle we end up generating this monstrosity to capitalize the empty string:

SELECT
  CASE
    WHEN LENGTH('') = 0
    THEN ''
    ELSE CASE
      WHEN UPPER(
        CASE
          WHEN (
            0 + 1
          ) >= 1
          THEN SUBSTR('', 0 + 1, 1)
          ELSE SUBSTR('', 0 + 1 + LENGTH(''), 1)
        END
      ) IS NULL
      OR LOWER(
        CASE
          WHEN (
            1 + 1
          ) >= 1
          THEN SUBSTR('', 1 + 1, CASE WHEN (
            LENGTH('') - 1
          ) < 0 THEN 0 ELSE LENGTH('') - 1 END)
          ELSE SUBSTR(
            '',
            1 + 1 + LENGTH(''),
            CASE WHEN (
              LENGTH('') - 1
            ) < 0 THEN 0 ELSE LENGTH('') - 1 END
          )
        END
      ) IS NULL
      THEN NULL
      ELSE CONCAT(
        UPPER(
          CASE
            WHEN (
              0 + 1
            ) >= 1
            THEN SUBSTR('', 0 + 1, 1)
            ELSE SUBSTR('', 0 + 1 + LENGTH(''), 1)
          END
        ),
        LOWER(
          CASE
            WHEN (
              1 + 1
            ) >= 1
            THEN SUBSTR('', 1 + 1, CASE WHEN (
              LENGTH('') - 1
            ) < 0 THEN 0 ELSE LENGTH('') - 1 END)
            ELSE SUBSTR(
              '',
              1 + 1 + LENGTH(''),
              CASE WHEN (
                LENGTH('') - 1
              ) < 0 THEN 0 ELSE LENGTH('') - 1 END
            )
          END
        )
      )
    END
  END AS "Capitalize('')"

And it's still not correct.

cpcloud commented 4 months ago

Alright, it turns out the mishandling of the empty string is reproducible in the oracledb driver alone. I opened https://github.com/oracle/python-oracledb/issues/298 for that.

cpcloud commented 4 months ago

Gosh, what a hellscape this problem is.

NickCrews commented 4 months ago

You're a hero @cpcloud ! Thank you!