sfu-db / connector-x

Fastest library to load data from DB to DataFrames in Rust and Python
https://sfu-db.github.io/connector-x
MIT License
2.02k stars 163 forks source link

MySQL TCP port argument is ignored #695

Open brian-pond opened 1 month ago

brian-pond commented 1 month ago

What language are you using?

Python 3.11.1

What version are you using?

connectorx 0.3.3

What database are you using?

MariaDB 10.11.6

What dataframe are you using?

Arrow

Can you describe your bug?

Environment Configuration

My host environment has 2 different MariaDB server instances. Each is bound to a different socket:

I have triple-checked and tested all credentials and tables. I did this using the standard mysql CLI tool that ships with the server software.

Code

My goal is to query Instance B and create a dataframe from its table "foo"

I have the following Python code:

sql_query = "SELECT name FROM `foo`"
db_uri = "mysql://user:password@127.0.0.1:9999/mydatabase"
arrow_table = connectorx.read_sql(db_uri, sql_query, return_type="arrow")

Scenario 1: Both servers online and active

Unexpectedly, my Python code fails with an error: :confused:

RuntimeError: MySqlError { ERROR 1146 (42S02): Table 'mydatabase.foo' doesn't exist }

This error should not happen. My database on TCP port 9999 definitely exists, and definitely has a table "foo". After much testing and troubleshooting, I thought to try another scenario.

Scenario 2: What if I shutdown -only- Instance B?

I completely shut down my Instance B. Nothing is bound to TCP 9999 anymore. I verify that with mysql CLI tool.

I try my code again. The result is a different (but expected) error:

    [2024-10-08T20:17:22Z ERROR r2d2] DriverError { Could not connect to address `127.0.0.1:9999': Connection refused (os error 111) }

This error ^ is a Good Thing. It's what I expected. My database instance is offline, so connectorx should definitely fail.

Scenario 3: What if I shutdown -only- Instance A?

I turn Instance B back on. And on a hunch, I shutdown my Instance A. So now nothing is bound to TCP 3306 anymore. This shouldn't make a difference. Because I'm not trying to connect to 3306. My URI clearly states 9999.

db_uri = "mysql://user:password@127.0.0.1:9999/mydatabase"

I try my code again. And it works perfectly. :eyes: :exploding_head:

Conclusion

Based on the above, I have to conclude this is what's happening:

  1. When connectorx is initialized, the software initially tries to establish a connection to the Port specified in your URI connection string. I demonstrated this in Scenario 2.
  2. But then
    • If TCP 3306 is online and available? Then connectorx tries to run your SQL query there. No matter what TCP port you actually specified.
    • If TCP 3306 is not online and available, then connectorx tries to use the port you told it to.
wangxiaoying commented 1 month ago

Hi @brian-pond ,

Thanks for opening the issue and the analyze it. However, I cannot reproduce your bug locally. Here is my test of having two mysql instances on both 3306 and 9999 port concurrently:

image

You can see the two tables are different (one with an additional row), indicating that the results are fetching from different mysql instances.

Can you double check your environment? Or is there anything that I missed?

P.S. We do not parse the url by ourselves in connectorx. Instead, we call from_url provided by the mysql rust client. Another quick check is to try running the rust client directly in your environment and see whether it is still the case. Here is an example of the rust code, you may also need to setup rust beforehand.

brian-pond commented 1 month ago

Weird! The results were so surprising that even after double-checking? I brought my spouse into my office, explained the whole thing, and walked them through each test and scenario, one at a time. Just to make sure I wasn't doing something stupid. I've been able to make it happen consistently. :confused:

Tonight I'll follow your suggestion by installing the mysql rust client, and see what happens.

Here are more details about my environment, though I'm doubtful this makes a big difference.

Host A is running the Python code. It has an SSH tunnel to Host B, so it "sees" Host B MariaDB over local TCP 9999.

Here are the specific Python packages installed in the virtual environment.

cachetools                    5.3.3
certifi                       2024.2.2
charset-normalizer            3.3.2
click                         8.1.6
connectorx                    0.3.3
contextlib2                   21.6.0
cron-converter                1.2.0
cron-descriptor               1.4.3
dw_etl                        0.1.0       /dw/dw_etl
google-api-core               2.19.0
google-auth                   2.29.0
google-cloud-bigquery         3.13.0
google-cloud-bigquery-storage 2.24.0
google-cloud-core             2.4.1
google-crc32c                 1.5.0
google-resumable-media        2.7.0
googleapis-common-protos      1.63.0
greenlet                      3.0.3
grpcio                        1.63.0
grpcio-status                 1.62.2
idna                          3.7
Jinja2                        3.1.3
local-crontab                 0.3.0
mariadb                       1.1.7
MarkupSafe                    2.1.5
numpy                         1.26.4
ordered-set                   4.1.0
packaging                     24.0
pandas                        2.0.3
phpserialize                  1.3
pip                           23.0.1
polars                        0.20.3
proto-plus                    1.23.0
protobuf                      4.25.3
psycopg                       3.1.19
psycopg-binary                3.1.19
psycopg-pool                  3.2.2
psycopg2                      2.9.9
pyarrow                       16.1.0
pyasn1                        0.6.0
pyasn1_modules                0.4.0
python-dateutil               2.9.0.post0
pytz                          2024.1
pytz-deprecation-shim         0.1.0.post0
requests                      2.31.0
rsa                           4.9
schema                        0.7.5
semantic-version              2.10.0
setuptools                    66.1.1
six                           1.16.0
SQLAlchemy                    1.4.50
sqlalchemy-bigquery           1.9.0
temporal-lib                  0.0.7
toml                          0.10.2
typing_extensions             4.11.0
tzdata                        2024.1
tzlocal                       5.2
urllib3                       2.2.1
wheel                         0.43.0
brian-pond commented 1 month ago

I even download the connectorx repo, and performed a global search to see if TCP 3306 was hard-coded anywhere, perhaps as a fallback value.

(it is not. only mentioned in markdown once, and a few times in GitHub Actions workflows)