Postgresql hangs on some queries on foreign tables (ExclusiveLock / AccessShareLock)

fnicollet commented 8 years ago

Hello,

I am having some trouble with the oracle_fdw connector. I do many requests on the foreign tables, and sometimes, on a simple select, everything hangs (deadlock?). For instance, on that simple query:

SELECT DISTINCT "ds"."lib_imp_tour" FROM (SELECT "t"."id_esi_ctrl_imp_tour" AS "id_esi_ctrl_imp_tour", "t"."lib_imp_tour" AS "lib_imp_tour" FROM "esi"."esi_ctrl_imp_tour" AS "t") AS "ds" ORDER BY "ds"."lib_imp_tour" ASC LIMIT 1000

Here is a pgAdmin screenshot: The database seem to be completly locked up. We don't have this kind of issue when connecting to a Oracle or PG table directly, it just seems to happen when selecting data from FDW tables.

Have you ever encountered this issue or something similar? Tell me if you need more information to debug the issue

Thanks, Fabien

laurenz commented 8 years ago

This is the first time that I hear of that problem.

My first guess is that the Oracle query either takes a long time or is waiting for a lock. Try to get the Oracle query by running EXPLAIN and see if you get the same behaviour if you run that query with sqlplus.

To detect locks on the Oracle side caused by an oracle_fdw query, you could run a query like this:

SELECT l.type, l.id1, l.id2, l.lmode, l.request, l.block
   FROM v$lock l JOIN v$session s ON l.sid = s.sid
   WHERE s.module = 'postgres' AND s.action = '13917';

Here 13917 is the process ID of the PostgreSQL backend.

To find out details about long running parts of an oracle_fdw query, you can use

SELECT l.opname, l.target, l.sofar / l.totalwork AS fraction, l.sql_plan_operation
   FROM v$session_longops l JOIN v$session s ON l.sid = s.sid AND l.serial# = s.serial#
   WHERE s.module = 'postgres' AND s.action = '13917';

fnicollet commented 8 years ago

Thanks @laurenz for the quick answer as usual. So, got another one of those select queries hanging:

When I run the EXPLAIN for this query:

explain SELECT "ds"."label", "ds"."value", "ds"."__orderColumn" FROM 
(
SELECT DISTINCT "t"."nom" AS "label", "t"."nom" AS "value", "t"."nom" AS "__orderColumn" FROM "esi"."esi_esp_carto_espace" AS "t" 
WHERE ((l_esi_esp_ctrl_vocation NOT LIKE'Domaine ski nordique' and l_esi_esp_ctrl_vocation NOT LIKE 'Domaine raquette' and l_esi_esp_ctrl_vocation NOT LIKE 'Station de ski' ) AND 
(officiel = '1' and l_esi_esp_ctrl_vocation NOT IN ('Site de spéléologie') ))
) AS "ds" 
ORDER BY "ds"."__orderColumn" ASC LIMIT 500

I get the following result:

"QUERY PLAN"
"Limit  (cost=20007.53..20007.53 rows=1 width=1548)"
"  ->  Sort  (cost=20007.53..20007.53 rows=1 width=1548)"
"        Sort Key: t.nom"
"        ->  HashAggregate  (cost=20007.50..20007.51 rows=1 width=516)"
"              Group Key: t.nom, t.nom, t.nom"
"              ->  Foreign Scan on esi_esp_carto_espace t  (cost=10000.00..20000.00 rows=1000 width=516)"
"                    Oracle query: SELECT /*bcfb7b7371a4e165b76f8df4a61874e5*/ ""NOM"", ""L_ESI_ESP_CTRL_VOCATION"", ""OFFICIEL"" FROM ""ESI"".""ESI_ESP_CARTO_ESPACE"" WHERE (""L_ESI_ESP_CTRL_VOCATION"" NOT LIKE 'Domaine ski nordique' ESCAPE '\') AND (""L_ESI_ESP_CTRL_VOCATION"" NOT LIKE 'Domaine raquette' ESCAPE '\') AND (""L_ESI_ESP_CTRL_VOCATION"" NOT LIKE 'Station de ski' ESCAPE '\') AND (""L_ESI_ESP_CTRL_VOCATION"" <> 'Site de spéléologie') AND (""OFFICIEL"" = '1')"

I am currently waiting on the Oracle DBAs to do the v$lock queries, I will update this thread once I get the result

Fabien

fnicollet commented 8 years ago

In the meantime, quick question. Our PG connections are pooled on a pool of about 10 connections. Is that a problem if the FDW is accessed by multiple connections simultaneously? Does the FDW have a single connection opened?

Another thing, is the LIMIT done in Oracle via rownum? Or in PG?

Thanks!

laurenz commented 8 years ago

A copy of oracle_fdw is running in each backend, so it is no problem to have several connections as they don't share anything. It is almost as if the Oracle clients were on different machines.

The LIMIT clause cannot be pushed down to Oracle and will be executed in the PostgreSQL server, so there will be a “FULL TABLE SCAN” on the Oracle server.

Normally that wouldn't be a problem, since PostgreSQL will just stop fetching rows when the limit is reached, but the way your query is written, it will have to read the complete Oracle table to finish. So if the Oracle table is large, that would explain the long duration. Without the DISTINCT and the ORDER BY things would be much faster.

I cannot see how many rows the Oracle table has because you didn't ANALYZE it (the option sample_percent will be useful).

fnicollet commented 8 years ago

Thanks!

So, I tried the Oracle queries. Both return 0 record. And I didn't get the PID wrong, as when I tried removing the PID, it still shows nothing for the "postgres" module. Still, I can see at the same time in pgAdmin that I have a few locks:

If I do an EXPLAIN ANALYZE of the previous query, here is the result:

"QUERY PLAN"
"Limit  (cost=20007.53..20007.53 rows=1 width=1548) (actual time=6.565..6.640 rows=281 loops=1)"
"  ->  Sort  (cost=20007.53..20007.53 rows=1 width=1548) (actual time=6.561..6.589 rows=281 loops=1)"
"        Sort Key: t.nom"
"        Sort Method: quicksort  Memory: 60kB"
"        ->  HashAggregate  (cost=20007.50..20007.51 rows=1 width=516) (actual time=4.864..4.935 rows=281 loops=1)"
"              Group Key: t.nom, t.nom, t.nom"
"              ->  Foreign Scan on esi_esp_carto_espace t  (cost=10000.00..20000.00 rows=1000 width=516) (actual time=1.791..4.468 rows=310 loops=1)"
"                    Oracle query: SELECT /*bcfb7b7371a4e165b76f8df4a61874e5*/ ""NOM"", ""L_ESI_ESP_CTRL_VOCATION"", ""OFFICIEL"" FROM ""ESI"".""ESI_ESP_CARTO_ESPACE"" WHERE (""L_ESI_ESP_CTRL_VOCATION"" NOT LIKE 'Domaine ski nordique' ESCAPE '\') AND (""L_ESI_ESP_CTRL_VOCATION"" NOT LIKE 'Domaine raquette' ESCAPE '\') AND (""L_ESI_ESP_CTRL_VOCATION"" NOT LIKE 'Station de ski' ESCAPE '\') AND (""L_ESI_ESP_CTRL_VOCATION"" <> 'Site de spéléologie') AND (""OFFICIEL"" = '1')"
"Planning time: 96.369 ms"
"Execution time: 6.894 ms"

The table itself is rather small, only 437 records, even a full scan it should be ok?

I don't know how to use the "sample_percent" option. Is it an option on the foreign table or on the ANALYZE query?

laurenz commented 8 years ago

If the table is that small, forget about sample_percent. Also, the problem cannot be the duration of a full table scan.

The reason that the Oracle query didn't return anything is probably that the query has not run yet or is already done.

Are there any locks in the PostgreSQL cluster that are not granted (accordé)?

If not, it seems like some “internal” lock that blocks the queries. I notice that there are two running queries – is that always the case when this problem arises?

Can you take a stack trace from those hanging queries and post it (with PostgreSQL and oracle_fdw debugging symbols, please)?

fnicollet commented 8 years ago

Well, it always shows up like this, with two locks (one AccessShareLock and one ExclusiveLock) on the same PID (because of the subquery?):

Is that what you meant? The locks always show up as "granted". (Accordé = oui in my french case :) )

I am going to try and provide you some stack traces

laurenz commented 8 years ago

The locks look completely normal – one is the transaction log that is always there, and the other is the lock on a table that is taken before reading. So the problem is not with the PostgreSQL locks.

fnicollet commented 8 years ago

Never done anything like this, so here is what I have down so far :) So, I installed Fedora's postgresql-debuginfo-9.5.4-1.fc24.x86_64.rpm Then I ran with my PID (gdb version is latest stable, 7.6.1-80.e17):

sudo gdb -p 18544

Which gave me a gdb prompt. If I type "bt", I get the following output:

#0  0x00007f0b28aa04a0 in __read_nocancel () from /lib64/libc.so.6
#1  0x00007f0af94b2920 in snttread () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#2  0x00007f0af94b1db5 in nttfprd () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#3  0x00007f0af94adb25 in nsbasic_brc () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#4  0x00007f0af94ad926 in nsbrecv () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#5  0x00007f0af94a39be in nioqrc () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#6  0x00007f0af94b959d in ttcdrv () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#7  0x00007f0af94a84c9 in nioqwa () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#8  0x00007f0af948a5cc in upirtrc () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#9  0x00007f0af76a38d1 in upirtr () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#10 0x00007f0af7706354 in kpurcs () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#11 0x00007f0af767795c in kputxst () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#12 0x00007f0af76a820b in OCITransStart () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#13 0x00007f0afa147751 in oracleGetSession (connectstring=0x2407520 "//tournesol/SIGDR", user=0x24076e0 "esi", password=0x24077f0 "esi",
    nls_lang=<optimized out>, tablename=tablename@entry=0x2407158 "esi_esp_carto_espace", curlevel=1) at oracle_utils.c:462
#14 0x00007f0afa13ec87 in getFdwState (foreigntableid=foreigntableid@entry=106300, plan_costs=plan_costs@entry=0x7ffd0d21e5a7 "",
    sample_percent=sample_percent@entry=0x0) at oracle_fdw.c:2263
#15 0x00007f0afa141e88 in oracleGetForeignRelSize (root=0x2256af8, baserel=0x24050b8, foreigntableid=106300) at oracle_fdw.c:786
#16 0x000000000060c5b3 in set_rel_size ()
#17 0x000000000060d265 in make_one_rel ()
#18 0x00000000006263e6 in query_planner ()
#19 0x0000000000627e13 in grouping_planner ()
#20 0x000000000062b205 in subquery_planner ()
#21 0x000000000062b52b in standard_planner ()
#22 0x00000000006aaaae in pg_plan_query ()
#23 0x00000000006aab84 in pg_plan_queries ()
#24 0x0000000000786806 in BuildCachedPlan ()
#25 0x00000000007869c7 in GetCachedPlan ()
#26 0x00000000006ac0d5 in PostgresMain ()
#27 0x0000000000469121 in ServerLoop ()
#28 0x0000000000656018 in PostmasterMain ()
#29 0x0000000000469d4e in main ()

If I keep getting "bt", it is always the same stack trace.

Looks like it is stuck in oracleGetSession / OCITransStart? Does that help?

Let me know if you need me to run further tests Fabien

fnicollet commented 8 years ago

Also got the lock on an UPDATE from a PG TRIGGER, stack trace is similar:

#0  0x00007f0b28aa04a0 in __read_nocancel () from /lib64/libc.so.6
#1  0x00007f0af61e5920 in snttread () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#2  0x00007f0af61e4db5 in nttfprd () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#3  0x00007f0af61e0b25 in nsbasic_brc () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#4  0x00007f0af61e0926 in nsbrecv () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#5  0x00007f0af61d69be in nioqrc () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#6  0x00007f0af61ec59d in ttcdrv () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#7  0x00007f0af61db4c9 in nioqwa () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#8  0x00007f0af61bd5cc in upirtrc () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#9  0x00007f0af61c7de6 in kpurcsc () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#10 0x00007f0af61c0d74 in kpuexec () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#11 0x00007f0af61bdf79 in OCIStmtExecute () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#12 0x00007f0af6e7d692 in oracleExecuteQuery (session=0x251c810, oraTable=0x25032c8, paramList=<optimized out>) at oracle_utils.c:1980
#13 0x00007f0af6e7604c in oracleExecForeignUpdate (estate=<optimized out>, rinfo=<optimized out>, slot=0x25369f0, planSlot=0x24e1be0)
    at oracle_fdw.c:1797
#14 0x00000000005c7a6c in ExecUpdate ()
#15 0x00000000005c83a7 in ExecModifyTable ()
#16 0x00000000005b0f18 in ExecProcNode ()
#17 0x00000000005ae0e0 in standard_ExecutorRun ()
#18 0x00000000005d504e in _SPI_execute_plan ()
#19 0x00000000005d5261 in SPI_execute ()
#20 0x00007f0afa044800 in exec_stmts () from /usr/pgsql-9.5/lib/plpgsql.so
#21 0x00007f0afa0459c9 in exec_stmt_block () from /usr/pgsql-9.5/lib/plpgsql.so
#22 0x00007f0afa0438af in exec_stmts () from /usr/pgsql-9.5/lib/plpgsql.so
#23 0x00007f0afa043988 in exec_stmts () from /usr/pgsql-9.5/lib/plpgsql.so
#24 0x00007f0afa044978 in exec_stmts () from /usr/pgsql-9.5/lib/plpgsql.so
#25 0x00007f0afa046ced in exec_for_query () from /usr/pgsql-9.5/lib/plpgsql.so
#26 0x00007f0afa043489 in exec_stmts () from /usr/pgsql-9.5/lib/plpgsql.so
#27 0x00007f0afa044978 in exec_stmts () from /usr/pgsql-9.5/lib/plpgsql.so
#28 0x00007f0afa045b28 in exec_stmt_block () from /usr/pgsql-9.5/lib/plpgsql.so
#29 0x00007f0afa046623 in plpgsql_exec_trigger () from /usr/pgsql-9.5/lib/plpgsql.so
#30 0x00007f0afa03adfa in plpgsql_call_handler () from /usr/pgsql-9.5/lib/plpgsql.so
#31 0x0000000000593c87 in ExecCallTriggerFunc ()
#32 0x00000000005991f1 in ExecBRUpdateTriggers ()
#33 0x00000000005c7b31 in ExecUpdate ()
#34 0x00000000005c83a7 in ExecModifyTable ()
#35 0x00000000005b0f18 in ExecProcNode ()
#36 0x00000000005ae0e0 in standard_ExecutorRun ()
#37 0x00000000006adf3e in ProcessQuery ()
#38 0x00000000006ae16d in PortalRunMulti ()
#39 0x00000000006aec5d in PortalRun ()
#40 0x00000000006ac211 in PostgresMain ()
#41 0x0000000000469121 in ServerLoop ()
#42 0x0000000000656018 in PostmasterMain ()
#43 0x0000000000469d4e in main ()

Could it be a bad configuration on the Oracle side (a number of max transaction or something)? are there some Oracle parameters to check out?

Thanks again, Fabien

fnicollet commented 8 years ago

Looks like it could be related to my distribution (CentOS 7.2) / Oracle client, see this SO thread: http://stackoverflow.com/questions/8489049/oracle-oci-fetch-delays-and-fails Would oracle_fdw work with Oracle Client v10 though? Some people say that OCI is not thread-safe, would that be a problem for oracle_fdw, the way it handles requests in parrallel?

fnicollet commented 8 years ago

Switched to OCI 10.2.0.5, also updated oracle_fdw to latest master. Let's see if I still get the locks ...

fnicollet commented 8 years ago

No luck, still getting random locks even with configuration "oracle_fdw 1.5.1devel, PostgreSQL 9.5.4, Oracle client 10.2.0.5.0"

laurenz commented 8 years ago

I can't see the connection with the stackoverflow question…

Thread safety is irrelevant here, since PostgreSQL is single-threaded (each backend is its own process).

Your stack traces show that the query hangs with the Oracle client waiting for a response from the Oracle server.

It would be interesting to know what the Oracle server does.

One simple thing you could try is to identify the Oracle server process and see what it does with strace -p or connect to it with gdb and get a stack trace (assuming it is on Linux; other platforms have other tools).

You can find out the Oracle server process for PostgreSQL backend 13917 like this:

SELECT p.spid
FROM v$process p JOIN v$session s ON p.addr = s.paddr
WHERE s.module = 'postgres' AND s.action = '13917';

Another, more difficult but probably more informative method is to start an Oracle server trace.

You can do this in Oracle with

SQL> SHOW PARAMETER service_names

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
service_names                        string      TESTDB

BEGIN
   DBMS_MONITOR.SERV_MOD_ACT_TRACE_ENABLE(
      service_name => 'TESTDB',
      module_name => 'postgres',
      action_name => '13917',
      waits => TRUE,
      binds => TRUE
   );
END;
/

You can omit action_name to trace all oracle_fdw sessions. You stop tracing with

BEGIN
   DBMS_MONITOR.SERV_MOD_ACT_TRACE_DISABLE(
      service_name => 'TESTDB',
      module_name => 'postgres',
      action_name => '13917'
   );
END;
/

I think this will only trace sessions that are already established when the call is made. The challenge is to start this before the server process hangs so you see something in the trace.

This will generate trace files in the trace location that is different for different server versions (grep for the query you are looking for). Maybe that can help narrowing down the problem.

fnicollet commented 8 years ago

Hello,

I am going to have an access to the Oracle DB later to do the query on v$session (I have full access on the PG server but not on the Oracle server). In the oracle_fdw description, it states:

Oracle connections are cached for the duration of the PostgreSQL session

In our program, we have a pool of PGSQL connections (JDBC), so each will have a set of Oracle connections in the FDW. Are these Oracle connections regularly checked to check if they are still alive?

We have seen with other customers that sometimes (not sure about this specific customer), they have system to kill connections after some time, or maybe their number of max process has been reached. In such a case, Oracle (or other components of the system) would kill the connection. If the connection is closed, does oracle_fdw know about it? Would it stay stuck in "OCITransStart", sending some data on a closed connection?

Thanks again for your time on this! Fabien

fnicollet commented 8 years ago

So, I have this stuck request: Just got the access to Oracle, when I used your request, I don't have anything that comes up with this PID. If I remove the PID where clause, I get 5 records: None of them have an "ACTION" that matches. When checking the PG stack strace, it is stuck on OCITransStart, just like before.

fnicollet commented 8 years ago

Just got another lock, this time on an "ACTION" which exist in the v$sessions (the 5027 one). I can grab the PSID with your request, but I cannot install any debug tools on that machine, so won't be able to trace like you told me to.

Is there something that can be done in the oracle_fdw.c file that could help troubleshoot the issue? Because I can add stuff and recompile it, no problem

At the moment, I am 100% lost :/

laurenz commented 8 years ago

If the Oracle connection gets terminated by the server or a firewall, oracle_fdw should recover nicely. For these expected Oracle errors

ORA-00028: your session has been killed
ORA-01012: not logged on
ORA-03113: end-of-file on communication channel
ORA-03135: connection lost contact

the connection will be re-established silently (see the code here). For other error messages you will get an error once in PostgreSQL, but the next statement in the same session should work ok (see the code here).

I am pretty certain that the problem is on the Oracle server – in both stack traces the query is hanging with the Oracle client reading on the network and waiting for a response from the Oracle server.

I am not really an Oracle guru, maybe your Oracle administrators can come up with some additional ideas what to check if you ask them what the Oracle server process is doing when the client waits for a response. Perhaps my queries from above can shed light on this.

Two more things:

Did you ever wait for two hours with a hanging query like that? There are some rather long TCP defaut timeouts that could cause a broken network connection to hang for a long time before an error is reported.
You could trace the Oracle client. For that you'd write something like this in your client's sqlnet.ora file:
```
DIAG_ADR_ENABLED = OFF
TRACE_DIRECTORY_CLIENT = /directory/with/room
TRACE_LEVEL_CLIENT = SUPPORT
SQLNET.ENCRYPTION_CLIENT = REJECTED
```
(the last entry only makes the packets more readable). Then you have to start a new PostgreSQL session, and the client will write tons of trace files. If it hangs, these could be examined to see what the client did immediately before the problem occurred. Maybe that gives us a clue.

fnicollet commented 8 years ago

Thanks a lot, I didn't think about the TCP timeout. I am not sure it stays for 2 hours but it stays for a long time then gets released. I'll check how much exactly and compare with the TCP timeout (on the PG machine or the Oracle machine?)

laurenz commented 8 years ago

On the PostgreSQL machine, I guess.

fnicollet commented 7 years ago

I am also going to check out this "Dead Connection Detection" Oracle thingie: http://www.oracle.com/technetwork/database/enterprise-edition/oraclenetdcd-2179641.pdf Looks like Oracle has a 15 minutes timeout by default. Might be a good place to look as well Also relevant: https://zhefeng.wordpress.com/2009/06/15/oracle-connection-idle-timeout-with-firewall/

laurenz commented 7 years ago

I just noticed that you wrote:

I am not sure it stays for 2 hours but it stays for a long time then gets released.

What do you mean by “gets released”? Does the query finish eventually? Do you get an error message?

Sure, reducing the detection time for broken connections is going to help – you'd get the error message sooner.

fnicollet commented 7 years ago

It's just that I don't see the lock anymore in pgAdmin. It never actually finished I think. Thing is, we also have a timeout on the web UI, so I wouldn't see the result anyway.

I have an access to the server, so I'll try again soon

On Tue, Oct 25, 2016 at 5:03 PM, Laurenz Albe notifications@github.com wrote:

I just noticed that you wrote:

I am not sure it stays for 2 hours but it stays for a long time then gets released.

What do you mean by “gets released”? Does the query finish eventually? Do you get an error message?

Sure, reducing the detection time for broken connections is going to help – you'd get the error message sooner.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/laurenz/oracle_fdw/issues/132#issuecomment-256061403, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLtoV3XkzPkgF2ToNmyebXSwJIQYgotks5q3hoygaJpZM4KYzVg .

Fabien Nicollet Tel : +33 (0)6 60 54 72 20

fnicollet commented 7 years ago

So, IT told me they disabled the firewall, but I can't really be sure if something else could interfere with the system. But I have a good hint! The lock gets released (not sure if complete or crashed) always after 15-16 minutes and in the Oracle "Dead Connection Detection" document, it says that the default TCP timeout is about 15 minutes. So I think I am getting closer to the problem ...

laurenz commented 7 years ago

You have to distinguish between timeouts when you send and when you receive.

If you want to send on a TCP connections and the other side is dead, the default TCP retransmission setting on Linux (15) should cause a hang of about 11 minutes before you get an error.

When reading from a connection that is dead, TCP will wait much longer before it sends keepalive packets. In that case, the default settings on Linux would cause a timeout of slightly more than two hours before reporting a failure if the other side is dead.

The funny thing is that to get where the Oracle client seems to be, it first has to send something ("start a new transaction" or "execute this statement"), which seems to work, but then, when the client waits for the answer, nothing comes back. Just as if the server had died after receiving the request.

So what do you observe? Do you get an error message after 15 minutes? If yes, what is the error message?

Depending on the message (if it is TCP related) it might be interesting to run a network trace with tcpdump or Wireshark when a query on a foreign table hangs.

fnicollet commented 7 years ago

It's hard to say, I haven't checked the log file of our program right after the 15 minutes, but also, our program might silently recover from the loss of connection in the connection pool and show nothing in the log file.

Could it also be caught earlier by the FDW, in one of the case your mentionned (the list of Oracle errors)? Does oracle_fdw log stuff anywhere when recovering from a connection loss?

On Fri, Oct 28, 2016 at 12:20 PM, Laurenz Albe notifications@github.com wrote:

You have to distinguish between timeouts when you send and when you receive.

If you want to send on a TCP connections and the other side is dead, the default TCP retransmission setting on Linux (15) should cause a hang of about 11 minutes before you get an error.

When reading from a connection that is dead, TCP will wait much longer before it sends keepalive packets. In that case, the default settings on Linux would cause a timeout of slightly more than two hours before reporting a failure if the other side is dead.

The funny thing is that to get where the Oracle client seems to be, it first has to send something ("start a new transaction" or "execute this statement"), which seems to work, but then, when the client waits for the answer, nothing comes back. Just as if the server had died after receiving the request.

So what do you observe? Do you get an error message after 15 minutes? If yes, what is the error message?

Depending on the message (if it is TCP related) it might be interesting to run a network trace with tcpdump or Wireshark when a query on a foreign table hangs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/laurenz/oracle_fdw/issues/132#issuecomment-256886917, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLtoS0puZDgYzXebflLESr7xJ6Mmvz_ks5q4cxhgaJpZM4KYzVg .

Fabien Nicollet Tel : +33 (0)6 60 54 72 20

laurenz commented 7 years ago

oracle_fdw will silently recover from a connection loss experienced during OCITransStart, because that is the first thing that is sent after a longer intermission, so it is kind of expected there.

A connection loss in OCIStmtExecute will cause an error message, however.

fnicollet commented 7 years ago

In the postgresql logs? Will the default pg config logs those errors?

On Fri, Oct 28, 2016 at 12:57 PM, Laurenz Albe notifications@github.com wrote:

oracle_fdw will silently recover from a connection loss experienced during OCITransStart, because that is the first thing that is sent after a longer intermission, so it is kind of expected there.

A connection loss in OCIStmtExecute will cause an error message, however.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/laurenz/oracle_fdw/issues/132#issuecomment-256893015, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLtoQBwzZ3gv3_-_1rR84h76b_KD-O3ks5q4dUlgaJpZM4KYzVg .

Fabien Nicollet Tel : +33 (0)6 60 54 72 20

laurenz commented 7 years ago

If log_min_messages and log_min_error_statement are error or better, it will be in the PostgreSQL server log.

Also, your application will receive an error message from the query. Are these not logged?

fnicollet commented 7 years ago

Thanks. They are probably logged, will have to check !

On Fri, Oct 28, 2016 at 1:39 PM, Laurenz Albe notifications@github.com wrote:

If log_min_messages and log_min_error_statement are error or better, it will be in the PostgreSQL server log.

Also, your application will receive an error message from the query. Are these not logged?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/laurenz/oracle_fdw/issues/132#issuecomment-256899448, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLtoShpCywml2CBPqJCmvVdZRWbGgNnks5q4d7wgaJpZM4KYzVg .

Fabien Nicollet Tel : +33 (0)6 60 54 72 20

laurenz commented 7 years ago

I'll close this due to inactivity; don't hesitate to reopen it if new information is available.

fnicollet commented 7 years ago

Hello,

I had a little more info lately, some network/oracle expert has been looking at the issue and it actually comes from a firewall, which sends a RESET but the connection stays "half-dead" it seems. Very unsure if that's actually an issue with oracle_fdw

Anyway, better close this issue

Fabien

On Mon, Jan 23, 2017 at 12:08 PM, Laurenz Albe notifications@github.com wrote:

I'll close this due to inactivity; don't hesitate to reopen it if new information is available.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/laurenz/oracle_fdw/issues/132#issuecomment-274459744, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLtoYNtE6y762_jq_jG9ItHrfTRnsCbks5rVIpHgaJpZM4KYzVg .

-- Fabien Nicollet Tel : +33 (0)6 60 54 72 20

laurenz commented 7 years ago

Re-reading the thred, I realize I still owe you an answer to

Does oracle_fdw log stuff anywhere when recovering from a connection loss?

The answer is that it should write a message to the PostgreSQL log if you set log_min_messages to debug2 (which I wouldn't enable for a long time, because it produces a lot of log). The message would be:

oracle_fdw: session has been terminated, try to reconnect

All of what you reported in this thread (suddenly lost connections, long hangs followed by the PostgreSQL backend dying) is consistent with a network that drops connections.

I'd investigate in this direction. Maybe you can get more information out of your network experts as to why these “RESET” packets are sent. Are they talking about TCP or the Oracle protocol?

laurenz / oracle_fdw

Postgresql hangs on some queries on foreign tables (ExclusiveLock / AccessShareLock) #132