sfu-db / connector-x

Fastest library to load data from DB to DataFrames in Rust and Python
https://sfu-db.github.io/connector-x
MIT License
1.94k stars 154 forks source link

Internal _read_sql function hangs for PostgreSQL13 via bouncer in GCloud #123

Open castarco opened 3 years ago

castarco commented 3 years ago

Problem:

_read_sql (used via polars) ends up hanging in my production system (specified in the context section), while it works perfectly fine in my local environment (Macos, Posgresql 13, Python 3.9.7).

I couldn't get past that function, as I guess it's implemented in Rust, and pdb is obviously unable to go inside.

Regarding the DB state, it's arguably the same (in local dev env, and in production), as I run a set of deterministic migrations, and the DB is pretty new, so no differences in contained data either.

Context:

My pdb session, it hangs on the last line, you can see some variables values I displayed:

(Pdb) b /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py:122
Breakpoint 1 at /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py:122
(Pdb) b /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py:128
Breakpoint 2 at /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py:128
(Pdb) c
2021-10-01 09:16:55.748 | INFO     | __main__:xxx - Bunch of irrelevant logs
2021-10-01 09:16:55.748 | INFO     | __main__:xxx - Bunch of irrelevant logs
2021-10-01 09:16:55.748 | INFO     | __main__:xxx - Bunch of irrelevant logs
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(122)read_sql()
-> elif return_type in {"arrow", "polars"}:
(Pdb) display queries
display queries: ['SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id \nFROM jobs \nWHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true ORDER BY jobs.job_id']
(Pdb) display protocol
display protocol: 'binary'
(Pdb) display partition_query
display partition_query: None
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(123)read_sql()
-> try:
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(124)read_sql()
-> import pyarrow
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(128)read_sql()
-> result = _read_sql(
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(129)read_sql()
-> conn,
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(130)read_sql()
-> "arrow",
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(131)read_sql()
-> queries=queries,
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(132)read_sql()
-> protocol=protocol,
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(133)read_sql()
-> partition_query=partition_query,
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(128)read_sql()
-> result = _read_sql(
(Pdb) s

Discussion

I'd be happy to continue helping to pinpoint what's the root cause of this problem. I have some experience with Rust, but I've never worked on Rust+Python projects. If there's anything I can do to debug the problem more in depth, I'll be thankful for the information.

wangxiaoying commented 3 years ago

@castarco thanks for the report and the details! I'm not sure in which stage the function hangs according to the pdb output. Can you update connectorx to the newest alpha version via pip install -U connectorx==0.2.1a2 and set the RUST_LOG environment variable before importing the library like the following example:

import os
os.environ["RUST_LOG"]="connectorx=debug,connectorx_python=debug"
import connectorx as cx

df = cx.read_sql(postgres_url, query, return_type="polars")

There should be some output logs from connectorx and connectorx_python. Can you show them here?

castarco commented 3 years ago

@wangxiaoying After updating the library to the current alpha version and setting the environment variable, I get the following logs once it enters into the function, and it hangs after the last line is printed:

[2021-10-02T19:56:00Z DEBUG connectorx_python::arrow] Protocol: binary
[2021-10-02T19:56:00Z DEBUG connectorx_python::arrow] Running dispatcher
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Fetching metadata
[2021-10-02T19:56:00Z DEBUG connectorx::sql] Transformed limit 1 query: SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id FROM jobs WHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true ORDER BY jobs.job_id LIMIT 1
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Prepare partitions
[2021-10-02T19:56:00Z DEBUG connectorx::sql] Transformed count query: SELECT count(*) FROM (SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id FROM jobs WHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true) AS CXTMPTAB_COUNT
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Allocate destination memory
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Create destination partition
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Partition 0, 298x10
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Start writing
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Finalize partition 0
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Partition 0 finished
dovahcrow commented 3 years ago

It seems to me that the destructor stucked at https://github.com/sfu-db/connector-x/blob/main/connectorx/src/dispatcher.rs#L153, because the line 155 debug!("Writing finished"); is not printed out.

To narrow it down, could you please help us try df = cx.read_sql(postgres_url, query, return_type="pandas") to see if it is the source (postgres) or destination (arrow) that hangs the whole process?

castarco commented 3 years ago

This is the output for the experiment you proposed (using "pandas" as return type, instead of "arrow" or "polars"), same output as before:

[2021-10-03T06:48:09Z DEBUG connectorx_python::pandas] Protocol: binary
[2021-10-03T06:48:09Z DEBUG connectorx_python::pandas] Running dispatcher
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Fetching metadata
[2021-10-03T06:48:09Z DEBUG connectorx::sql] Transformed limit 1 query: SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id FROM jobs WHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true ORDER BY jobs.job_id LIMIT 1
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Prepare partitions
[2021-10-03T06:48:09Z DEBUG connectorx::sql] Transformed count query: SELECT count(*) FROM (SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id FROM jobs WHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true) AS CXTMPTAB_COUNT
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Allocate destination memory
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Create destination partition
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Partition 0, 298x10
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Start writing
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Finalize partition 0
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Partition 0 finished

Here the script I used to reproduce the problem:

import connectorx as cx

url = "postgresql://blablablah"

query = "SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id \nFROM jobs \nWHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true ORDER BY jobs.job_id"

df = cx.read_sql(url, query, return_type="pandas")

print(df)
dovahcrow commented 3 years ago

@castarco Thanks. This helps us narrow down the bug to either rayon or the Postgres.