Sessions getting stuck at active while inserting from oracle to postgres

ChristophorusReyhan commented 6 months ago

Background: I'm trying to insert select of multiple tables from oracle to postgres using parallel. The command goes something like: echo $tables | tr ' ' '\n' | parallel --eta --colsep '\t' --ungroup -j$num_cores insert_to_pg where the $tables contains variable list of tables to be inserted. Function insert_to_pg contains the insert command like: psql -U postgres -d dbname -c "INSERT INTO $1 select * from "sourcedb_fdw".$1;" The problem is, sometime, the session is stuck at being active while not processing anything. On the oracle side, that session is already inactive. It always stuck at certain table. But when I execute it manually inside psql, insert to that table work just fine. Here's last few lines of the strace on that stuck session:

20:22:03 pwrite64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 440516608) = 8192
20:22:03 write(8, "\0\25\0\0\6\0\0\0\0\0\3\5V\2\0\0\0\311\0\0\0", 21) = 21
20:22:03 read(8, 0x1030fb6, 8208)       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
20:22:10 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
20:22:10 rt_sigreturn({mask=[URG]})     = 0
20:22:10 read(8, 0x1030fb6, 8208)       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
20:22:51 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=344, si_uid=26} ---
20:22:51 rt_sigreturn({mask=[URG]})     = 0
20:22:51 read(8,

It's just stuck at read(8, Anyone understands why this happens? Any help or pointer to debug this would be appreciated.

laurenz commented 6 months ago

If the Oracle client is waiting for a reply from the Oracle server and vice versa, that sounds like a network problem. I have no idea why that should happen only when you are running the commands from a shell script. One think you can try is to run an Oracle client trace and see what you get in the trace file.

ChristophorusReyhan commented 6 months ago

I found the following from oracle alert log:

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.3.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
  Time: 21-MAY-2024 10:37:37
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: <unknown>
opiodr aborting process unknown ospid (24559) as a result of ORA-609
Tue May 21 10:39:55 2024

***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.3.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
  Time: 21-MAY-2024 10:39:55
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=<IP ADDR>)(PORT=<PORT>))
WARNING: inbound connection timed out (ORA-3136)

The trace in oracle side: stuck_trace.txt

How do I trace from client side if I'm using oracle_fdw?

laurenz commented 6 months ago

See here for a short description of how to get an Oracle client trace. I did a web search on TNS-12535 and got the feeling that it is usually a network problem.

ChristophorusReyhan commented 6 months ago

Okay, on the client side it created some trace file with name like cli_.trc. Here's the trace after being converted using trcasst: cli_14095.txt

Here's the detail of my fdw:

select oracle_diag('dbname_server');
                                      oracle_diag
---------------------------------------------------------------------------------------
 oracle_fdw 2.6.0, PostgreSQL 15.5, Oracle client 11.1.0.7.0, Oracle server 11.2.0.3.0
(1 row)

I think it's stuck in a loop?

ChristophorusReyhan commented 6 months ago

Huh, I think because I run multiple insert in parallel, it used a lot of network bandwith, which it caused unstable connection, and results in WARNING: inbound connection timed out (ORA-3136) I configured sqlnet.ora to:

SQLNET.INBOUND_CONNECT_TIMEOUT=60
SQLNET.SEND_TIMEOUT=60
SQLNET.RECV_TIMEOUT=60
SQLNET.EXPIRE_TIME=10

and for now, it works just fine. I'll close this for now. Thank you so much for the pointer!

laurenz commented 6 months ago

Thank you for your diagnosis and solution, that may be valuable information for others.

I always look at the raw trace file — I cannot see an ORA-03136 in the cooked version you provided.

ChristophorusReyhan commented 6 months ago

Hmm, it still happens when I did a lot of parallel select from the oracle fdw. Currently I'm using the workaround in postgresql conf so that if it runs too long, it drops the connection:

idle_in_transaction_session_timeout = '10min'
idle_session_timeout = '180min'

I also think it's related to issue #132 .

I will also mention, sometime my entire server sometimes freeze/unresponsive (including I can't connect a new session from ssh) when running this parallel script.

ChristophorusReyhan commented 6 months ago

Sadly, I can't reproduce this so that it always run into that problem ever since I changed sqlnet.ora in the oracle side. Sometimes it does it, sometimes it doesn't.

However, here's the last few lines of the raw trace file where the cooked version originated (I omitted lines above this because it contains sensitive info)

[22-MAY-2024 23:33:43:290] nsbasic_brc: 00 00 00 00 00 00 00 00  |........|
[22-MAY-2024 23:33:43:290] nsbasic_brc: 00 00 00 00 00 00 00 00  |........|
[22-MAY-2024 23:33:43:290] nsbasic_brc: 00 00 00 00 00 00 00 00  |........|
[22-MAY-2024 23:33:43:290] nsbasic_brc: 00 00 00 00              |....    |
[22-MAY-2024 23:33:43:290] nsbasic_brc: exit: oln=0, dln=3786, tot=3796, rc=0
[22-MAY-2024 23:33:43:290] nioqrc: exit
[22-MAY-2024 23:33:43:291] nioqsn: entry
[22-MAY-2024 23:33:43:291] nioqsn: exit
[22-MAY-2024 23:33:43:291] nioqrc: entry
[22-MAY-2024 23:33:43:291] nsbasic_bsd: entry
[22-MAY-2024 23:33:43:291] nsbasic_bsd: tot=0, plen=21.
[22-MAY-2024 23:33:43:291] nsbasic_bsd: packet dump
[22-MAY-2024 23:33:43:291] nsbasic_bsd: 00 15 00 00 06 00 00 00  |........|
[22-MAY-2024 23:33:43:291] nsbasic_bsd: 00 00 03 05 6B 02 00 00  |....k...|
[22-MAY-2024 23:33:43:291] nsbasic_bsd: 00 32 00 00 00           |.2...   |
[22-MAY-2024 23:33:43:291] nsbasic_bsd: exit (0)
[22-MAY-2024 23:33:43:291] nsbasic_brc: entry: oln/tot=0
[22-MAY-2024 23:33:43:292] nsbasic_brc: type=6, plen=3673
[22-MAY-2024 23:33:43:292] nsbasic_brc: what=1, tot =1420

laurenz commented 6 months ago

There is no error message in the part of the trace you are showing. See if there is one somewhere else in the trace; otherwise, the trace is not from an Oracle database session that experienced an error.

There may be a connection to the ticket you are mentioning. There, the problem turned out to be in the network.

If the whole system freezes up, try to check how the situation of the OS resources is. Is the disk or the CPU overloaded? Are you running OOM, and PostgreSQL is running in a container that makes it freeze in that case?

ChristophorusReyhan commented 6 months ago

There may be a connection to the ticket you are mentioning. There, the problem turned out to be in the network.

I encountered this error again today, and the problem goes away after I'm restarting the oracle db service. I'm kinda lost.

If the whole system freezes up, try to check how the situation of the OS resources is. Is the disk or the CPU overloaded? Are you running OOM, and PostgreSQL is running in a container that makes it freeze in that case?

CPU was not overloaded, it depends on how much worker I assign for the parallel, for example if I only assign 4 workers, I only used 4 cores max from available 8 cores. I didn't check the IO load, will do if it runs into that problem again. It didn't run OOM since I only used 80% of total memory for the postgres. It's running on VM so I think it can also be the problem?

laurenz commented 6 months ago

I don't have a good clue either. Would you agree with me that the problem is somewhere between Oracle client, the network and the Oracle server? At any rate, outside of the scope of oracle_fdw.

ChristophorusReyhan commented 6 months ago

Oracle client trace didn't throw any error, it just stuck there, while the session remains active in postgres side.

On network side, tcp keepalive is the default of 7200.

Oracle server only throws WARNING: inbound connection timed out (ORA-3136).

Running the script in single CPU works fine, but it takes longer.

For now, I will use idle timeout and sqlnet workaround. Feel free to close this, I'll add more comment when I get any lead.

Thank you so much for the help

laurenz commented 6 months ago

I don't have a good clue either. Would you agree with me that the problem is somewhere between Oracle client, the network and the Oracle server? At any rate, outside of the scope of oracle_fdw.

ChristophorusReyhan commented 6 months ago

I'll add more info, I think it also related to issue #336.

But on the host side, sar doesn't seem weird, the highest it went is 09:10:01 AM all 1.44 0.06 4.15 0.00 0.00 94.35 (I set the worker to only 2)

Same as that ticket, it freezes for a few minutes.

laurenz / oracle_fdw

Sessions getting stuck at active while inserting from oracle to postgres #675