lovasoa / SQLpage

SQL-only webapp builder, empowering data analysts to build websites and applications quickly
https://sql.ophir.dev
MIT License
895 stars 63 forks source link

Server stalls indefinitely #189

Closed lorefnon closed 5 months ago

lorefnon commented 6 months ago

Introduction

I have unfortunately not been able to arrive at a clear minimal reproducer (and will try more over next couple of days) but I frequently encounter a scenario where the server stops responding and all requests gets stuck in pending state. SQLPage becomes completely unusable for all pages/users until the server is restarted.

I do not see any error in the logs and the Memory/CPU usage stays low (~9.5 MB, ~0) and constant when/after this happens.

Some of our queries are expensive and we don't mind these pages taking long while we are iterating on the final outcome (I am able to run them from psql) but it would be ideal if this didn't impact the availability of the server as whole.

Screenshots

image

Expected behavior

Server should stay functional.

Version information

lovasoa commented 6 months ago

This is concerning! Did you try to run SQLPage in debug mode (RUST_LOG=debug) to see where exactly it it stalls?

You can control logging with RUST_LOG=sqlpage=trace to get even more logging, but only from sqlpage, not dependencies.

lorefnon commented 6 months ago

I had tried debug before but that was not informative. After enabling trace I could see a message about pg pool getting exhausted

[2024-01-15T11:32:47.926Z TRACE sqlpage::webserver::http] Received item from database: Error(Unable to acquire a database connection to execute the SQL file. All of the 6 Postgres connections are busy.

    Caused by:
        pool timed out while waiting for an open connection)
[2024-01-15T11:32:47.927Z ERROR sqlpage::render] SQL error: Unable to acquire a database connection to execute the SQL file. All of the 6 Postgres connections are busy.

    Caused by:
        pool timed out while waiting for an open connection

This was strange because:

So I increased the pool size to 100 in sqlpage.json and this message disappeared. However the problem remains. After a few dozen or so refreshes my server hangs (after which nothing gets logged) - I can reproduce this locally too with a locally running postgres so likely not related to RDS or my front proxy config.

lovasoa commented 6 months ago

Where exactly does it stall ? You don't even see the initial log about the incoming http connection ?

What do you see on the database side ? How many connections do you see opened and are they busy ?

lorefnon commented 6 months ago

Where exactly does it stall ? You don't even see the initial log about the incoming http connection ?

It seems to be getting blocked on fs cache population.

Eg. the logs for the last handled request look like this:

[2024-01-15T14:03:28.221Z DEBUG sqlpage::webserver::http] Processing SQL request: "bi/top-sold-products.sql"
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher] end timers:
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   head timer is inactive
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   keep-alive timer is inactive
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   shutdown timer is inactive
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher] end flags: Flags(STARTED)
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher] start flags: Flags(STARTED)
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher] start timers:
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   head timer is inactive
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   keep-alive timer is inactive
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   shutdown timer is inactive
[2024-01-15T14:03:28.221Z TRACE sqlpage::file_cache] Cache answer with filesystem metadata read for "bi/top-sold-products.sql"
[2024-01-15T14:03:28.221Z DEBUG sqlpage::webserver::http_request_info] Not parsing POST data from request without known content type
[2024-01-15T14:03:28.221Z DEBUG sqlpage::webserver::http] Received a request with the following parameters: RequestInfo { path: "/bi/top-sold-products.sql", pro
tocol: "http", get_variables: {"romanized_state": Single(""), "limit": Single("20"), "produce_type_id": Single(""), "romanized_district": Single(""), "end_date"
: Single("2023-12-31"), "user_id": Single(""), "produce_type_category_id": Single(""), "_sqlpage_embed": Single(""), "romanized_city_or_village": Single(""), "p
in_code": Single(""), "start_date": Single("2023-12-01")}, post_variables: {}, uploaded_files: {}, headers: {"sec-ch-ua": Single("\"Chromium\";v=\"112\", \"Goog
le Chrome\";v=\"112\", \"Not:A-Brand\";v=\"99\""), "sec-fetch-dest": Single("empty"), "sec-fetch-mode": Single("cors"), "sec-ch-ua-mobile": Single("?0"), "accep
t-language": Single("en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7"), "host": Single("localhost:9898"), "connection": Single("keep-alive"), "sec-ch-ua-platform": Sing$
e("\"Linux\""), "accept-encoding": Single("gzip, deflate, br"), "accept": Single("*/*"), "user-agent": Single("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.3
6 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"), "sec-fetch-site": Single("same-origin"), "referer": Single("http://localhost:9898/bi/?start_date=2023-12
-01&end_date=2023-12-31&perf_tab=seller"), "dnt": Single("1")}, client_ip: Some(127.0.0.1), cookies: {}, basic_auth: None, app_state: AppState }
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher] end timers:
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   head timer is inactive
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   keep-alive timer is inactive
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher]   shutdown timer is inactive
[2024-01-15T14:03:28.221Z TRACE actix_http::h1::dispatcher] end flags: Flags(STARTED)
[2024-01-15T14:03:28.221Z DEBUG sqlpage::render] Handling header row: {"component":"chart","type":"bar","title":"Top Sold Products","horizontal":true,"xtitle":"Percent","ytitle":"Product","toolbar":true,"height":"auto","xticks":5}
[2024-01-15T14:03:28.222Z DEBUG sqlpage::render] Creating the shell component for the page
[2024-01-15T14:03:28.222Z TRACE sqlpage::file_cache] File "sqlpage/templates/fragment-shell.handlebars" not found, loading it from static files instead.
[2024-01-15T14:03:28.222Z TRACE sqlpage::file_cache] Loaded File "sqlpage/templates/fragment-shell.handlebars" from static files.

After this nothing gets logged for any incoming request.

What do you see on the database side ? How many connections do you see opened and are they busy ?

On db side, pg_stat_activity says that all of the connections are idle:

select pid as process_id, 
       usename as username, 
       datname as database_name, 
       client_addr as client_address, 
       application_name,
       backend_start,
       state,
       state_change
from pg_stat_activity
where datname = 'test_dev2';
process_id|username|database_name|client_address|application_name                         |backend_start                |state |state_change                 |
----------+--------+-------------+--------------+-----------------------------------------+-----------------------------+------+-----------------------------+
     32806|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:31:14.073 +0530|idle  |2024-01-15 19:33:28.187 +0530|
     32935|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:21.616 +0530|idle  |2024-01-15 19:33:26.878 +0530|
     32939|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:22.423 +0530|idle  |2024-01-15 19:33:26.775 +0530|
     11378|postgres|test_dev2|127.0.0.1     |DBeaver 23.3.1 - Main <test_dev2>    |2024-01-15 17:04:44.332 +0530|idle  |2024-01-15 17:04:48.874 +0530|
     32940|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:22.484 +0530|idle  |2024-01-15 19:33:26.681 +0530|
     32954|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:22.799 +0530|idle  |2024-01-15 19:33:26.522 +0530|
     32950|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:22.728 +0530|idle  |2024-01-15 19:33:26.479 +0530|
     11380|postgres|test_dev2|127.0.0.1     |DBeaver 23.3.1 - Metadata <test_dev2>|2024-01-15 17:04:44.576 +0530|idle  |2024-01-15 19:27:06.698 +0530|
     31894|postgres|test_dev2|127.0.0.1     |DBeaver 23.3.1 - SQLEditor <Console>     |2024-01-15 19:21:02.729 +0530|active|2024-01-15 19:35:59.548 +0530|
     32955|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:22.838 +0530|idle  |2024-01-15 19:33:26.655 +0530|
     32956|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:22.877 +0530|idle  |2024-01-15 19:33:26.616 +0530|
     32958|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:24.508 +0530|idle  |2024-01-15 19:33:28.188 +0530|
     32960|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:25.014 +0530|idle  |2024-01-15 19:33:26.825 +0530|
     32961|postgres|test_dev2|127.0.0.1     |                                         |2024-01-15 19:33:25.533 +0530|idle  |2024-01-15 19:33:26.723 +0530|

(Active ones are for the SQL Editor)

No. of connections is not same as the configured max limit (100) which does seem to be getting picked up:

[2024-01-15T14:01:14.070Z DEBUG sqlpage] Starting with the following configuration: AppConfig {
        database_url: "postgres://postgres:password@localhost:5432/test_dev2",
        max_database_pool_connections: Some(
            100,
        ),
        database_connection_idle_timeout_seconds: None,
        database_connection_max_lifetime_seconds: None,
        sqlite_extensions: [],
        listen_on: Some(
            0.0.0.0:9898,
        ),
        port: None,
        database_connection_retries: 6,
        database_connection_acquire_timeout_seconds: 10.0,
        web_root: "/home/lorefnon/Workspace/.../modules/reports/out",
        allow_exec: false,
        max_uploaded_file_size: 5242880,
        https_domain: None,
        https_certificate_email: None,
        https_certificate_cache_dir: "/home/lorefnon/Workspace/.../sqlpage/https",
        https_acme_directory_url: "https://acme-v02.api.letsencrypt.org/directory",
        environment: Development,
    }
lorefnon commented 6 months ago

If I comment out the cache.insert(..) here this no longer happens.

lovasoa commented 6 months ago

Can you try to make a small docker composer reproduction, or add more trace logging to understand where it stalls ? It would be strange for the hashmap insertion to be what hangs.

lorefnon commented 6 months ago

Potentially relevant: https://github.com/xacrimon/dashmap/issues/233

lovasoa commented 6 months ago

Interesting ! Can you try replacing the dashmap with a simple RwLock<HashMap<...>> and report if this fixes the issue ? If so, can you make it a pr ?

lorefnon commented 6 months ago

std::sync::RwLock based impl was getting stuck too. But using tokio::sync::RwLock seems to resolve the problem. This post was my motivation for trying out the tokio impl.

PR: https://github.com/lovasoa/SQLpage/pull/193

lovasoa commented 5 months ago

Great that the issue is fixed. However, looking at the code, it seems that we don't really need to hold the lock across an async boundary, do we ?

It happens because of the if let, but we could cheaply clone the Cached object and immediately release the lock, couldn't we ?

I'm a bit hesitant to change the code again if I cannot reproduce the original issue locally. Do you think you could add a test in https://github.com/lovasoa/SQLpage/tree/main/tests that breaks with the old implementation and passes with the new one ?

lorefnon commented 5 months ago

Do you think you could add a test in https://github.com/lovasoa/SQLpage/tree/main/tests that breaks with the old implementation and passes with the new one ?

I am not too sure about me being able to do this, but I'll try over the next week. My familiarity with both rust and concurrent systems is quite nascent.

looking at the code, it seems that we don't really need to hold the lock across an async boundary...

Yes, to validate this, I attempted a workaround which retains dashmap but drops the ref obtained from cache before the async call (to check timestamp from filesystem) and reloads it from cache after it. This too seems to solve the problem.

Maybe its just that I am not used to low level programming, but this kind of limitation in a library appears to be a footgun in async heavy code.

Nevertheless, if you prefer to retain dashmap, here is an updated PR that retains dashmap and appears to work fine without stalling.