Internal _read_sql function hangs for PostgreSQL13 via bouncer in GCloud

castarco commented 3 years ago

Problem:

_read_sql (used via polars) ends up hanging in my production system (specified in the context section), while it works perfectly fine in my local environment (Macos, Posgresql 13, Python 3.9.7).

I couldn't get past that function, as I guess it's implemented in Rust, and pdb is obviously unable to go inside.

Regarding the DB state, it's arguably the same (in local dev env, and in production), as I run a set of deterministic migrations, and the DB is pretty new, so no differences in contained data either.

Context:

Connector-X version: 0.2.0 (latest at this moment)
Pod inside Google Cloud
Python 3.9.7 , running on a Debian slim base image.
The function is called in the context of an async function (but there are no other concurrent calls to _read_sql).
Accessing PostgreSQL 13 DB, via a "bouncer" node (that exists to deal with high number of connections).
The DB load is ridiculously low.
Pod resources usage is low, no resources constraints are affecting it.
In fact, CPU usage drops to 0 when the execution flow reaches _read_sql (while memory usage stays the same).

My pdb session, it hangs on the last line, you can see some variables values I displayed:

(Pdb) b /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py:122
Breakpoint 1 at /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py:122
(Pdb) b /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py:128
Breakpoint 2 at /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py:128
(Pdb) c
2021-10-01 09:16:55.748 | INFO     | __main__:xxx - Bunch of irrelevant logs
2021-10-01 09:16:55.748 | INFO     | __main__:xxx - Bunch of irrelevant logs
2021-10-01 09:16:55.748 | INFO     | __main__:xxx - Bunch of irrelevant logs
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(122)read_sql()
-> elif return_type in {"arrow", "polars"}:
(Pdb) display queries
display queries: ['SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id \nFROM jobs \nWHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true ORDER BY jobs.job_id']
(Pdb) display protocol
display protocol: 'binary'
(Pdb) display partition_query
display partition_query: None
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(123)read_sql()
-> try:
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(124)read_sql()
-> import pyarrow
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(128)read_sql()
-> result = _read_sql(
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(129)read_sql()
-> conn,
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(130)read_sql()
-> "arrow",
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(131)read_sql()
-> queries=queries,
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(132)read_sql()
-> protocol=protocol,
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(133)read_sql()
-> partition_query=partition_query,
(Pdb) s
> /usr/local/src/app/.venv/lib/python3.9/site-packages/connectorx/__init__.py(128)read_sql()
-> result = _read_sql(
(Pdb) s

Discussion

I'd be happy to continue helping to pinpoint what's the root cause of this problem. I have some experience with Rust, but I've never worked on Rust+Python projects. If there's anything I can do to debug the problem more in depth, I'll be thankful for the information.

wangxiaoying commented 3 years ago

@castarco thanks for the report and the details! I'm not sure in which stage the function hangs according to the pdb output. Can you update connectorx to the newest alpha version via pip install -U connectorx==0.2.1a2 and set the RUST_LOG environment variable before importing the library like the following example:

import os
os.environ["RUST_LOG"]="connectorx=debug,connectorx_python=debug"
import connectorx as cx

df = cx.read_sql(postgres_url, query, return_type="polars")

There should be some output logs from connectorx and connectorx_python. Can you show them here?

castarco commented 3 years ago

@wangxiaoying After updating the library to the current alpha version and setting the environment variable, I get the following logs once it enters into the function, and it hangs after the last line is printed:

[2021-10-02T19:56:00Z DEBUG connectorx_python::arrow] Protocol: binary
[2021-10-02T19:56:00Z DEBUG connectorx_python::arrow] Running dispatcher
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Fetching metadata
[2021-10-02T19:56:00Z DEBUG connectorx::sql] Transformed limit 1 query: SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id FROM jobs WHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true ORDER BY jobs.job_id LIMIT 1
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Prepare partitions
[2021-10-02T19:56:00Z DEBUG connectorx::sql] Transformed count query: SELECT count(*) FROM (SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id FROM jobs WHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true) AS CXTMPTAB_COUNT
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Allocate destination memory
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Create destination partition
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Partition 0, 298x10
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Start writing
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Finalize partition 0
[2021-10-02T19:56:00Z DEBUG connectorx::dispatcher] Partition 0 finished

dovahcrow commented 3 years ago

It seems to me that the destructor stucked at https://github.com/sfu-db/connector-x/blob/main/connectorx/src/dispatcher.rs#L153, because the line 155 debug!("Writing finished"); is not printed out.

To narrow it down, could you please help us try df = cx.read_sql(postgres_url, query, return_type="pandas") to see if it is the source (postgres) or destination (arrow) that hangs the whole process?

castarco commented 3 years ago

This is the output for the experiment you proposed (using "pandas" as return type, instead of "arrow" or "polars"), same output as before:

[2021-10-03T06:48:09Z DEBUG connectorx_python::pandas] Protocol: binary
[2021-10-03T06:48:09Z DEBUG connectorx_python::pandas] Running dispatcher
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Fetching metadata
[2021-10-03T06:48:09Z DEBUG connectorx::sql] Transformed limit 1 query: SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id FROM jobs WHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true ORDER BY jobs.job_id LIMIT 1
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Prepare partitions
[2021-10-03T06:48:09Z DEBUG connectorx::sql] Transformed count query: SELECT count(*) FROM (SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id FROM jobs WHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true) AS CXTMPTAB_COUNT
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Allocate destination memory
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Create destination partition
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Partition 0, 298x10
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Start writing
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Finalize partition 0
[2021-10-03T06:48:09Z DEBUG connectorx::dispatcher] Partition 0 finished

Here the script I used to reproduce the problem:

import connectorx as cx

url = "postgresql://blablablah"

query = "SELECT jobs.job_id, jobs.is_remote, jobs.employment_type_id, jobs.language_id, jobs.category_id, jobs.country_id, jobs.industry_id, jobs.contract_type_id, jobs.function_level_id, jobs.number_of_employees_id \nFROM jobs \nWHERE jobs.target_market_id = 1 AND jobs.has_premium_orders = true ORDER BY jobs.job_id"

df = cx.read_sql(url, query, return_type="pandas")

print(df)

dovahcrow commented 3 years ago

@castarco Thanks. This helps us narrow down the bug to either rayon or the Postgres.

sfu-db / connector-x