molybdenumsoftware / pr-tracker

A system for finding out which branches a GitHub pull request landed in
4 stars 0 forks source link

pr-tracker-fetcher `INSERT INTO landings ...` taking a really long time #178

Open jfly opened 2 months ago

jfly commented 2 months ago

Ever since upgrading to v3.0.0, I've been seeing pr-tracker-fetcher take a really long time to insert landings. For example, here's some info about a current run:

[root@clark:~]# ps -efw | grep -i pr-tracker-fetcher
pr-trac+ 3570307       1 70 09:26 ?        00:15:43 /nix/store/lq6bxhmfpd36l488sypd0ci3hzpacb6c-pr-tracker-fetcher-3.0.0/bin/pr-tracker-fetcher
postgres 3570312     778  0 09:26 ?        00:00:01 postgres: pr-tracker-fetcher pr-tracker [local] idle
root     3570970   74562  0 09:26 pts/0    00:00:00 journalctl -u pr-tracker-fetcher -f
root     3630459 3583530  0 09:48 pts/1    00:00:00 grep -i pr-tracker-fetcher

[nix-shell:~]# date -uIs
2024-04-21T16:57:57+00:00

[root@clark:~]# echo "SELECT * from pg_stat_activity where pid=3570312;" | psql --user pr-tracker-fetcher pr-tracker --expanded
-[ RECORD 1 ]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
datid            | 65938
datname          | pr-tracker
pid              | 3570312
leader_pid       |
usesysid         | 74173
usename          | pr-tracker-fetcher
application_name |
client_addr      |
client_hostname  |
client_port      | -1
backend_start    | 2024-04-21 16:26:05.897356+00
xact_start       |
query_start      | 2024-04-21 16:26:20.188579+00
state_change     | 2024-04-21 16:26:21.695147+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle
backend_xid      |
backend_xmin     |
query_id         |
query            | INSERT INTO landings(github_pr, branch_id) VALUES (257339,1),(305707,1),(305765,1),(256680,1),(236751,1),(305580,1),(256361,1),(305756,1),(296557,1),(301963,1),(254536,1),(256617,1),(295509,1),(305425,1),(305598,1),(255347,1),(253578,1),(256341,1),(305653,1),(304884,1),(305615,1),(305589,1),(256642,1),(254427,1),(253302,1),(256460,1),(305656,1),(303709,1),(305689,1),(305406,1),(305647,1),(305579,1),(256694,1),(254306,1),(254419,1),(256072,1),(253005,1),(256607,1),(305679,1),(301204,1),(304595,1),(305649,1),(303590,1),(305401,1),(305530,1),(303485,1),(249224,1),(254304,1),(249132,1),(254385,1),(251597,1),(255579,1),(305360,1),(281388,1),(305562,1),(303656,1),(304660,1),(303859,1),(303959,1),(299049,1),(305176,1),(305636,1),(305352,1),(254128,1),(255274,1),(225177,1),(254391,1),(254056,1),(254303,1),(254563,1),(256183,1),(253448,1),(305744,1),(305700,1),(303661,1),(303461,1),(305153,1),(303985,1),(304915,1),(297475,1),(305287,1),(305230,1),(305438,1),(305327,1),(256656,1),(256426,1),(255136,1),(253370,1),(2543
backend_type     | client backend

You can see that this query started over 30 minutes ago!

The rust process seems to be pegging a core, while the postgres process is doing nothing:

[nix-shell:~]# pidstat -h -r -u -v -p 3570307,3570312 5
Linux 6.6.27 (clark)    04/21/2024      _x86_64_        (4 CPU)

# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM threads   fd-nr  Command
09:53:44 AM   991   3570307   98.40    0.00    0.00    0.60   98.40     3      0.00      0.00  407336   61636   0.77       5      10  .pr-tracker-fet
09:53:44 AM    71   3570312    0.00    0.00    0.00    0.00    0.00     3      0.00      0.00  339664  264900   3.30       1      67  .postgres-wrapp

I think psql is just waiting on a socket:

[nix-shell:~]# strace -p 3570312
strace: Process 3570312 attached
epoll_wait(4,

And some gdb information on the rust process. It seems to be in the middle of doing some string formatting in the guts of sqlx:

[nix-shell:~]# gdb -p 3570307
...
(gdb) info threads
  Id   Target Id                                             Frame
* 1    Thread 0x7f41a85f1dc0 (LWP 3570307) ".pr-tracker-fet" 0x0000560955f4c06e in <core::iter::adapters::take::Take<I> as core::iter::traits::iterator::Iterator>::next ()
  2    Thread 0x7f41a85ed6c0 (LWP 3570308) "tokio-runtime-w" 0x00007f41a86fd1e6 in epoll_wait ()
   from /nix/store/ddwyrxif62r8n6xclvskjyy6szdhvj60-glibc-2.39-5/lib/libc.so.6
  3    Thread 0x7f41a83ec6c0 (LWP 3570309) "tokio-runtime-w" 0x00007f41a86facfd in syscall ()
   from /nix/store/ddwyrxif62r8n6xclvskjyy6szdhvj60-glibc-2.39-5/lib/libc.so.6
  4    Thread 0x7f41a01e86c0 (LWP 3570310) "tokio-runtime-w" 0x00007f41a86facfd in syscall ()
   from /nix/store/ddwyrxif62r8n6xclvskjyy6szdhvj60-glibc-2.39-5/lib/libc.so.6
  5    Thread 0x7f41a3fff6c0 (LWP 3570311) "tokio-runtime-w" 0x00007f41a86facfd in syscall ()
   from /nix/store/ddwyrxif62r8n6xclvskjyy6szdhvj60-glibc-2.39-5/lib/libc.so.6
(gdb) thread apply 1 bt

Thread 1 (Thread 0x7f41a85f1dc0 (LWP 3570307) ".pr-tracker-fet"):
#0  0x0000560955f4c06e in <core::iter::adapters::take::Take<I> as core::iter::traits::iterator::Iterator>::next ()
#1  0x0000560955f4c56c in <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter ()
#2  0x0000560955f524b1 in sqlformat::tokenizer::get_uc_words ()
#3  0x0000560955f4e988 in sqlformat::tokenizer::tokenize ()
#4  0x0000560955f4c709 in sqlformat::format ()
#5  0x0000560955f3f424 in sqlx_core::logger::QueryLogger::finish ()
#6  0x0000560955cb59f5 in sqlx_postgres::connection::executor::<impl sqlx_postgres::connection::PgConnection>::run::{{closure}}::{{closure}}::{{closure}} ()
#7  0x0000560955cd6fb4 in <futures_util::future::future::fuse::Fuse<Fut> as core::future::future::Future>::poll ()
#8  0x0000560955cd707d in <sqlx_core::ext::async_stream::TryAsyncStream<T> as futures_core::stream::Stream>::poll_next ()
#9  0x0000560955cb13e7 in sqlx_postgres::connection::executor::<impl sqlx_core::executor::Executor for &mut sqlx_postgres::connection::PgConnection>::fetch_many::{{closure}}::{{closure}} ()
#10 0x0000560955cd6fb4 in <futures_util::future::future::fuse::Fuse<Fut> as core::future::future::Future>::poll ()
#11 0x0000560955cd707d in <sqlx_core::ext::async_stream::TryAsyncStream<T> as futures_core::stream::Stream>::poll_next ()
#12 0x0000560955cf32b4 in <futures_util::stream::try_stream::try_filter_map::TryFilterMap<St,Fut,F> as futures_core::stream::Stream>::poll_next ()
#13 0x0000560955cbb54e in <futures_util::stream::try_stream::try_collect::TryCollect<St,C> as core::future::future::Future>::poll ()
#14 0x0000560955cee7c4 in sqlx_core::query::Query<DB,A>::execute::{{closure}} ()
#15 0x0000560955ce2c08 in pr_tracker_fetcher::run::{{closure}} ()
#16 0x0000560955ced842 in tokio::runtime::park::CachedParkThread::block_on ()
#17 0x0000560955cc2bea in tokio::runtime::context::runtime::enter_runtime ()
#18 0x0000560955ce0003 in pr_tracker_fetcher::main ()
#19 0x0000560955ceed23 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#20 0x0000560955d1800d in std::rt::lang_start::{{closure}} ()
#21 0x00005609561ff021 in std::rt::lang_start_internal ()
#22 0x0000560955ce019c in main ()

I believe we're stuck here: https://github.com/molybdenumsoftware/pr-tracker/blob/v3.0.0/crates/store/src/lib.rs#L468

Theory: under the hood, sqlx "streams" its output to postgresql, which is why we see the query over in postgresql, but it's still streaming the query, so postgresql has nothing to do.

jfly commented 2 months ago

That query finally finished, and sqlx logged a slow query warning:

$ journalctl -u pr-tracker-fetcher
...
Apr 21 10:37:16 clark pr-tracker-fetcher-start[3570307]: [2024-04-21T17:37:15Z WARN  sqlx::query] slow statement: execution time exceeded alert threshold summary="INSERT INTO landings(github_pr, branch_id) …" db.statement="\n\nINSERT INTO\n  landin
gs(github_pr, branch_id)\nVALUES\n
[... ~2.8 mb of text ...]
ON CONFLICT (github_pr, branch_id) DO NOTHING\n" rows_affected=0 rows_returned=0 elapsed=1.937380964s slow_threshold=1s

So I guess that means that postgresql was able to chug through the ~8mb query in a bit under 2 seconds, but I also know that from postgresql's perspective, this query started quite some time ago (see above):

query_start      | 2024-04-21 16:26:20.188579+00

So from postgresql's perspective, that query took over 1 hour:

$ nix-shell -p 'python3.withPackages (ps: with ps; [ dateutil ])' --run python
>>> import datetime as dt; from dateutil.parser import parse
>>> (parse("2024-04-21T17:37:15Z") - parse("2024-04-21 16:26:20.188579+00")).total_seconds()/3600
1.181892061388889