Crash with 2.6devel (array_prefetch)

philflorent commented 1 year ago

Hi, I have an issue I can reproduce with the 2.6devel (array_prefetch). I query the foreign table via count(1) and I obtain this error when I validate the transaction :


dhphil=> select oracle_diag();
                                    oracle_diag
-----------------------------------------------------------------------------------
 oracle_fdw 2.6.0devel, PostgreSQL 15.2 (Debian 15.2-1), Oracle client 19.18.0.0.0
(1 ligne)

dhphil=> select count(1) from orcl_usr_dhphil.ACT_DIM_HRSS;
  count
----------
 20020692
(1 ligne)

dhphil=*> select count(1) from ACT_DIM_HRSS;
 count
-------
     0
(1 ligne)

dhphil=*> commit;
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
PANIC:  ERRORDATA_STACK_SIZE exceeded
erreur SYSCALL SSL : EOF détecté

Whole cluster crash:

2023-04-07 19:59:21.321 CEST [4064446] usr_dhphil@dhphil ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-07 19:59:21.321 CEST [4064446] usr_dhphil@dhphil INSTRUCTION : commit; 2023-04-07 19:59:21.321 CEST [4064446] usr_dhphil@dhphil ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-07 19:59:21.321 CEST [4064446] usr_dhphil@dhphil ATTENTION: AbortTransaction while in ABORT state 2023-04-07 19:59:21.322 CEST [4064446] usr_dhphil@dhphil ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-07 19:59:21.322 CEST [4064446] usr_dhphil@dhphil ATTENTION: AbortTransaction while in ABORT state 2023-04-07 19:59:21.322 CEST [4064446] usr_dhphil@dhphil ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-07 19:59:21.322 CEST [4064446] usr_dhphil@dhphil ATTENTION: AbortTransaction while in ABORT state 2023-04-07 19:59:21.322 CEST [4064446] usr_dhphil@dhphil ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-07 19:59:21.322 CEST [4064446] usr_dhphil@dhphil PANIC: ERRORDATA_STACK_SIZE exceeded 2023-04-07 19:59:21.452 CEST [485162] LOG: processus serveur (PID 4064446) a été arrêté par le signal 6 : Abandon 2023-04-07 19:59:21.452 CEST [485162] DÉTAIL: Le processus qui a échoué exécutait : commit;

Automatic recover OK. I can reproduce at will. I reinstalled oracle-fdw from master and I can't reproduce the issue.

Best regards, Phil

philflorent commented 1 year ago

Hi,

I ran today my stability "select *" test and it failed. t_clob is a 100k rows version of the t_clob table of the test case (it's a bit long to obtain it, Oracle 2 recursive syntax is slow and no "generate series").
Oracle server has 8Gb of RAM, Postgres server 4Gb of RAM.

I obtain an ORA-4030. It means the lack of work_mem is in fact on Oracle side (not really an Oracle error in fact, an ORA-4030 just means "system did not give the necessary RAM") . It does not make Oracle instance to crash but rogue process abruptly quits (expected) What was not expected is that I have a complete crash of the cluster on PostgreSQL side with a segmentation fault even if I do have available memory. Its not a problem of RAM exhaustion. OOM did not activate and it is configured to be conservative anyway (=> https://pgphil.ovh/oom_killer_96_01.php , I ran the described test and it is OK my rogue process is killed but the cluster and other client processes survives)

To reproduce "select * from t_clob;" (test has to be adjusted if RAM is bigger to obtain an ORA-4030)

Output in postgres log file : OCI-21503: programme arrêté à la suite d'une erreur fatale OCI-04030: Mémoire de traitement insuffisante lors de la tentative d'affectation de 1048616 octets (Alloc environm,ttcl2i: prefetch buf alloc)

----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value)

Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 skgudmp()+154 call kgdsdst() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kgefecl()+784 call skgudmp() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kghnospc()+8845 call kgefecl() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kghalo()+3294 call kghnospc() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kpuhhalocV1()+457 call kghalo() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? ttclobPrefetchInit( call kpuhhalocV1() 000000000 ? 000000000 ? )+164 000000000 ? 000000000 ? 000000000 ? 000000000 ? ttcl2i()+13500 call ttclobPrefetchInit( 000000000 ? 000000000 ? ) 000000000 ? 000000000 ? 000000000 ? 000000000 ? ttcaGetCol()+1174 call ttcl2i() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? ttccfpg()+547 call ttcaGetCol() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? ttcfour()+814 call ttccfpg() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? ttcdrv()+1814 call ttcfour() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? nioqwa()+61 call ttcdrv() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? upirtrc()+1646 call nioqwa() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kpurcsc()+100 call upirtrc() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kpuexec()+10122 call kpurcsc() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? OCIStmtExecute()+41 call kpuexec() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? oracleExecuteQuery( call OCIStmtExecute() 000000000 ? 000000000 ? )+475 000000000 ? 000000000 ? 000000000 ? 000000000 ? oracleIterateForeig call oracleExecuteQuery( 000000000 ? 000000000 ? nScan()+512 ) 000000000 ? 000000000 ? 000000000 ? 000000000 ? 00005569C494E913 call oracleIterateForeig 000000000 ? 000000000 ? nScan() 000000000 ? 000000000 ? 000000000 ? 000000000 ? standard_ExecutorRu call 00005569C494E913 000000000 ? 000000000 ? n()+275 000000000 ? 000000000 ? 000000000 ? 000000000 ? 00005569C4AAB26E call standard_ExecutorRu 000000000 ? 000000000 ? n() 000000000 ? 000000000 ? 000000000 ? 000000000 ? PortalRun()+416 call 00005569C4AAB26E 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 00005569C4AA8ADD call PortalRun() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 00005569C4A27A91 call PostgresMain() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(41558) SELECT for reading: errno=2 PostmasterMain()+33 call 00005569C4A27A91 000000000 ? 000000000 ? 49 000000000 ? 000000000 ? 000000000 ? 000000000 ? main()+523 call PostmasterMain() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? __libc_init_first() call main() 000000000 ? 000000000 ? +138 000000000 ? 000000000 ? 000000000 ? 000000000 ?

Call stack signature: 0xcbcd3281a2609157

call stack performance statistics: total : 0.020000 sec setup : 0.000000 sec stack unwind : 0.000000 sec symbol translation : 0.010000 sec printing the call stack: 0.010000 sec printing frame data : 0.000000 sec printing argument data : 0.000000 sec

2023-04-08 15:01:56.387 CEST [151848] LOG: processus serveur (PID 153942) a été arrêté par le signal 11 : Erreur de segmentation 2023-04-08 15:01:56.387 CEST [151848] DÉTAIL: Le processus qui a échoué exécutait : select * from orcl_usr_dh00.t_clob ; 2023-04-08 15:01:56.387 CEST [151848] LOG: arrêt des autres processus serveur actifs 2023-04-08 15:01:56.408 CEST [153950] usr_dh00@dh00 FATAL: le système de bases de données est en cours de restauration 2023-04-08 15:01:56.416 CEST [151848] LOG: tous les processus serveur sont arrêtés ; réinitialisation 2023-04-08 15:01:56.890 CEST [153955] LOG: le système de bases de données a été interrompu ; dernier lancement connu à 2023-04-08 14:59:31 CEST 2023-04-08 15:01:56.892 CEST [153958] postgres@traqueur FATAL: le système de bases de données est en cours de restauration 2023-04-08 15:01:56.949 CEST [153955] LOG: le système de bases de données n'a pas été arrêté proprement ; restauration automatique en cours 2023-04-08 15:01:56.953 CEST [153955] LOG: la ré-exécution commence à 0/270A40C0 2023-04-08 15:01:56.960 CEST [153955] LOG: longueur invalide de l'enregistrement à 0/27124700 : voulait 24, a eu 0 2023-04-08 15:01:56.960 CEST [153955] LOG: rejeu exécuté à 0/271246C0 utilisation système : CPU : utilisateur : 0.00 s, système : 0.00 s, temps passé : 0.00 s 2023-04-08 15:01:56.966 CEST [153956] LOG: début du checkpoint : end-of-recovery immediate wait 2023-04-08 15:01:56.972 CEST [153961] postgres@traqueur FATAL: le système de bases de données n'accepte pas encore de connexions 2023-04-08 15:01:56.972 CEST [153961] postgres@traqueur DÉTAIL: L'état de restauration cohérent n'a pas encore été atteint. 2023-04-08 15:01:56.981 CEST [153956] LOG: checkpoint terminé : a écrit 79 tampons (0.2%); 0 fichiers WAL ajoutés, 0 supprimés, 0 recyclés ; écriture=0.010 s, synchronisation=0.003 s, total=0.017 s; fichiers synchronisés=43, plus long=0.003 s, moyenne=0.001 s; distance=513 kB, estimation=513 kB 2023-04-08 15:01:56.992 CEST [151848] LOG: le système de bases de données est prêt pour accepter les connexions

Best regards, Phil

philflorent commented 1 year ago

Hi Laurenz, Sorry I did not give an essential information : of course you have to set a "stupid" prefetch value (e.g 3000 or 5000) to obtain the error. Best regards, Phil

laurenz commented 1 year ago

I tried to reproduce this, but failed. I created a table with five rather large LOBs, but that probably wasn't enough.

Do you have a script to create that large Oracle table?

It seems like there are several problems mixed together, and I find it difficult to reproduce them.

philflorent commented 1 year ago

Here is my test:

-- oracle side

Version Oracle Linux 7.2 / Oracle 11.2.0.4.
Red Hat Enterprise Linux Server release 7.2 (Maipo)
Oracle Linux Server release 7.2

[oracle@sv-t-vtl-bas03 ~]$ cat /proc/meminfo
MemTotal:        7914252 kB

[oracle@sv-t-vtl-bas03 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 61662
max locked memory       (kbytes, -l) 134217728
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Connected to:
Oracle Database 11g Release 11.2.0.4.0 - 64bit Production

SQL> show parameter pga

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target                 big integer 300M

create table t_clob(c1 integer primary key, val1 clob, c2_date_creation date, c3_date_maj date, val2 clob);

truncate table t_clob;
insert /*+ append */ into t_clob with serie(c1, val1, c2_date_creation, c3_date_maj, val2) as (
        select
                rownum c1, 
                dbms_random.string('L', 2000),
                sysdate-((100000-rownum)/86400),
                sysdate-((100000-rownum)/86400) + 1/1440,
                dbms_random.string('L', 6000)
        from dual
        connect by
                level <= 100000
) select * from serie;
;

commit;

SELECT
Nvl(S.segment_name, 'TABLE TOTAL SIZE') Segment_name,
Round(SUM(S.bytes) / 1024 / 1024, 1) Segment_size_MB
FROM user_segments S
WHERE S.segment_name IN ( 'T_CLOB' )
OR S.segment_name IN ((SELECT L.segment_name
FROM user_lobs L
WHERE L.table_name = 'T_CLOB'))
GROUP BY rollup( S.segment_name );

SEGMENT_NAME                             SEGMENT_SIZE_MB
---------------------------------------- ---------------
SYS_LOB0000253884C00002$$                             .1
SYS_LOB0000253884C00005$$                            792
T_CLOB                                               264
TABLE TOTAL SIZE                                  1056.1

-- postgres side

postgres@sv-t-vtl-bas09:~$ cat /etc/*se
PRETTY_NAME="Debian GNU/Linux bookworm/sid"
NAME="Debian GNU/Linux"
VERSION_CODENAME=bookworm

postgres@sv-t-vtl-bas09:~$ cat /proc/meminfo
MemTotal:        4019176 kB

\d+  orcl_usr_dh00.t_clob
                                                           Table distante « orcl_usr_dh00.t_clob »
     Colonne      |              Type              | Collationnement | NULL-able | Par défaut | Options FDW  | Stockage | Cible de statistiques | Description
------------------+--------------------------------+-----------------+-----------+------------+--------------+----------+-----------------------+-------------
 c1               | numeric                        |                 | not null  |            | (key 'true') | main     |                       |
 val1             | text                           |                 |           |            |              | extended |                       |
 c2_date_creation | timestamp(0) without time zone |                 |           |            |              | plain    |                       |
 c3_date_maj      | timestamp(0) without time zone |                 |           |            |              | plain    |                       |
 val2             | text                           |                 |           |            |              | extended |                       |
Serveur : orcl_usr_dh00
Options FDW : (schema 'USR_PRODIGE31', "table" 'T_CLOB', prefetch '10000')

select * from orcl_usr_dh00.t_clob;
erreur SYSCALL SSL : EOF détecté
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

I obtain the ORA-4030 at Oracle side and the cluster crash with a segmentation fault at postgres side. If your system has much RAM and if you want want to obtain an ORA-4030 more easily it's possible to lower the user limits at Oracle side.

About the other issue (maybe a different one, I can open another topic if you want):

-- psql with autocommit off

dh00=*> select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

dh00=*> commit;
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
PANIC:  ERRORDATA_STACK_SIZE exceeded
erreur SYSCALL SSL : EOF détecté
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

philflorent commented 1 year ago

I reran the same test after a make/make install from the master branch:

dh00=> select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

dh00=*> commit;
COMMIT

dh00=*> explain analyze select * from orcl_usr_dh00.t_clob;

                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Foreign Scan on t_clob  (cost=10000.00..20000.00 rows=1000 width=112) (actual time=10.699..305697.545 rows=100000 loops=1)
   Oracle query: SELECT /*4892aaead4d426f*/ r1."C1", r1."VAL1", r1."C2_DATE_CREATION", r1."C3_DATE_MAJ", r1."VAL2" FROM "USR_PRODIGE31"."T_CLOB" r1
 Planning Time: 7.229 ms
 Execution Time: 305740.388 ms
(4 lignes)

dh00=*> commit;
COMMIT

laurenz commented 1 year ago

\d+  orcl_usr_dh00.t_clob
[...]
Options FDW : (schema 'USR_PRODIGE31', "table" 'T_CLOB', prefetch '10000')

Well, if you set prefetch to 10000, I am not surprised that you go OOM. I tried your example, and the OOM killer killed PostgreSQL. But I agree that there should be no segmentation fault, even if Oracle goes OOM. Did you get a core dump? Can you generate a stack trace from it?

philflorent commented 1 year ago

I will try to obtain that but I can already confirm that it's a "clean" ORA-4030 at Oracle side. Process does not obtain its RAM, it obtains an error instead but oom does not wake up.
No oom at PostgreSQL side too. As a client program PostgreSQL just obtains the OCI errors produced at Oracle side (21503 "FATAL" , the real one being the 4030) and the whole cluster crashes. I didn't expect that case to work, it's just the crash that is not expected. Notice the other error is different. It is not at all expected. I just have to open a transaction, to query a foreign table via count(1) (no matter the lob_prefetch/prefetch) and to validate the transaction to reproduce. If the prefetch is not "stupid", the select * is OK (that's why I could complete the "real" test to confirm the performance is now great) but a simple select count(1) is KO (or to be more precise it is OK and the commit is KO)

dh00=> alter foreign table orcl_usr_dh00.t_clob options(set prefetch '10');
ALTER FOREIGN TABLE
dh00=*> commit;
COMMIT
dh00=> select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

dh00=*> commit;
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
PANIC:  ERRORDATA_STACK_SIZE exceeded
erreur SYSCALL SSL : EOF détecté
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

...
dh00=> explain analyze select * from orcl_usr_dh00.t_clob ;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Foreign Scan on t_clob  (cost=10000.00..20000.00 rows=1000 width=112) (actual time=16.052..49603.517 rows=100000 loops=1)
   Oracle query: SELECT /*4892aaead4d426f*/ r1."C1", r1."VAL1", r1."C2_DATE_CREATION", r1."C3_DATE_MAJ", r1."VAL2" FROM "USR_PRODIGE31"."T_CLOB" r1
 Planning Time: 99.046 ms
 Execution Time: 49618.971 ms
(4 lignes)

Will have access to the test environment tomorrow to obtain a core dump.

philflorent commented 1 year ago

Quick exec but I don't have the necessary sym packages installed.

1) Program received signal SIGSEGV, Segmentation fault. 0x00007fdbf9f0f8c6 in slaac_int () from /opt/oracle/instantclient/libclntsh.so.19.1 Detaching from program: /usr/lib/postgresql/15/bin/postgres, process 9746 [Inferior 1 (process 9746) detached]

2) Program received signal SIGABRT, Aborted. __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: Aucun fichier ou dossier de ce type. Detaching from program: /usr/lib/postgresql/15/bin/postgres, process 9649 [Inferior 1 (process 9649) detached]

philflorent commented 1 year ago

Here is the client trace file for the ORA-4030

Trace file /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/trace/ora_9989_140583396035136.trc kgefec: fatal error 0

at 0x7ffec7e967d8 placed kge.c@2791 OCI-21503: programme arrêté à la suite d'une erreur fatale OCI-04030: Mémoire de traitement insuffisante lors de la tentative d'affectation de 1048616 octets (Alloc environm,ttcl2i: prefetch buf alloc) Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 DDE: Flood control is not active 2023-04-13T00:14:50.818007+02:00 Incident 145 created, dump file: /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/incident/incdir_145/ora_9989_140583396035136_i145.trc oci-24550 [11] [[si_signo=11] [si_errno=0] [si_code=1] [si_int=0] [si_ptr=(nil)] [si_addr=(nil)]] [] [] [] [] [] [] [] [] [] [] Dump file /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/incident/incdir_145/ora_9989_140583396035136_i145.trc [TOC00000] Jump to table of contents Dump continued from file: /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/trace/ora_9989_140583396035136.trc [TOC00001] oci-24550 [11] [[si_signo=11] [si_errno=0] [si_code=1] [si_int=0] [si_ptr=(nil)] [si_addr=(nil)]] [] [] [] [] [] [] [] [] [] [] [TOC00001-END] [TOC00002] ========= Dump for incident 145 (oci 24550 [11]) ======== Tracing is in restricted mode! at 0x7ffec7e950e8 placed dbge.c@1317 [TOC00003] ----- Short Call Stack Trace ----- dbgexPhaseII()+1869<-dbgexProcessError()+1871<-dbgePostErrorDirectVaList_int()+2100<-dbgePostErrorDirect()+798<-kpeDbgSignalHandler()+299<-skgesig_sigactionHandler()+258<-__sighandler()<-kpuhhalocV1()+705<-ttclobPrefetchInit()+164<-ttcl2i()+13500<-ttcaGetCol()+1174<-ttccfpg()+547<-ttcfour()+814<-ttcdrv()+1814<-nioqwa()+61<-upirtrc()+1646<-kpurcsc()+100<-kpuexec()+10122<-OCIStmtExecute()+41<-oracleExecuteQuery()+475<-oracleIterateForeignScan()+512Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 <-0x00005566E6878913Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(34148) EXPLAIN for reading: errno=2 <-0x00005566E6866209<-standard_ExecutorRun()+275<-ExplainOnePlan()+603[TOC00003-END] [TOC00004] ----- START Event Driven Actions Dump ---- ---- END Event Driven Actions Dump ---- [TOC00004-END] [TOC00005] ----- START DDE Actions Dump ----- Executing SYNC actions Executing ASYNC actions ----- END DDE Actions Dump (total 0 csec) ----- [TOC00005-END] End of Incident Dump [TOC00002-END] TOC00000 - Table of contents TOC00001 - Error Stack TOC00002 - Dump for incident 145 (oci 24550 [11]) | TOC00003 - Short Call Stack Trace | TOC00004 - START Event Driven Actions Dump | TOC00005 - START DDE Actions Dump End of TOC

philflorent commented 1 year ago

Without explain analyze :

Dump file /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/incident/incdir_161/ora_10159_140583396035136_i161.trc [TOC00000] Jump to table of contents Dump continued from file: /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/trace/ora_10159_140583396035136.trc [TOC00001] oci-24550 [11] [[si_signo=11] [si_errno=0] [si_code=1] [si_int=0] [si_ptr=(nil)] [si_addr=(nil)]] [] [] [] [] [] [] [] [] [] [] [TOC00001-END] [TOC00002] ========= Dump for incident 161 (oci 24550 [11]) ======== Tracing is in restricted mode!

at 0x7ffec7e954e8 placed dbge.c@1317 [TOC00003] ----- Short Call Stack Trace ----- dbgexPhaseII()+1869<-dbgexProcessError()+1871<-dbgePostErrorDirectVaList_int()+2100<-dbgePostErrorDirect()+798<-kpeDbgSignalHandler()+299<-skgesig_sigactionHandler()+258<-__sighandler()<-kpuhhalocV1()+705<-ttclobPrefetchInit()+164<-ttcl2i()+13500<-ttcaGetCol()+1174<-ttccfpg()+547<-ttcfour()+814<-ttcdrv()+1814<-nioqwa()+61<-upirtrc()+1646<-kpurcsc()+100<-kpuexec()+10122<-OCIStmtExecute()+41<-oracleExecuteQuery()+475<-oracleIterateForeignScan()+512Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(37380) SELECT for reading: errno=2 <-0x00005566E6878913<-standard_ExecutorRun()+275Cannot open postgres: 15/dh_p: usr_dh00 dh00 10.194.14.49(37380) SELECT for reading: errno=2 <-0x00005566E69D526E<-PortalRun()+416[TOC00003-END] [TOC00004] ----- START Event Driven Actions Dump ---- ---- END Event Driven Actions Dump ---- [TOC00004-END] [TOC00005] ----- START DDE Actions Dump ----- Executing SYNC actions Executing ASYNC actions ----- END DDE Actions Dump (total 0 csec) ----- [TOC00005-END] End of Incident Dump [TOC00002-END] TOC00000 - Table of contents TOC00001 - Error Stack TOC00002 - Dump for incident 161 (oci 24550 [11]) | TOC00003 - Short Call Stack Trace | TOC00004 - START Event Driven Actions Dump | TOC00005 - START DDE Actions Dump End of TOC

philflorent commented 1 year ago

About the count + commit error :

Trace file /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/trace/ora_10081_140583396035136.trc DDE: Flood control is not active 2023-04-13T00:22:43.677407+02:00 Incident 153 created, dump file: /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/incident/incdir_153/ora_10081_140583396035136_i153.trc oci-24550 [6] [[si_signo=6] [si_errno=0] [si_code=-6] [si_int=0] [si_ptr=(nil)] [si_addr=0x6900002761]] [] [] [] [] [] [] [] [] [] []

Dump file /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/incident/incdir_153/ora_10081_140583396035136_i153.trc [TOC00000] Jump to table of contents Dump continued from file: /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/trace/ora_10081_140583396035136.trc [TOC00001] oci-24550 [6] [[si_signo=6] [si_errno=0] [si_code=-6] [si_int=0] [si_ptr=(nil)] [si_addr=0x6900002761]] [] [] [] [] [] [] [] [] [] [] [TOC00001-END] [TOC00002] ========= Dump for incident 153 (oci 24550 [6]) ======== Tracing is in restricted mode!

at 0x7ffec7e9ffa8 placed dbge.c@1317 [TOC00003] ----- Short Call Stack Trace ----- dbgexPhaseII()+1869<-dbgexProcessError()+1871<-dbgePostErrorDirectVaList_int()+2100<-dbgePostErrorDirect()+798<-kpeDbgSignalHandler()+299<-skgesig_sigactionHandler()+258<-__sighandler()<-pthread_key_delete()+332<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0<-0x00007FDC1F61D3E0[TOC00003-END] [TOC00004] ----- START Event Driven Actions Dump ---- ---- END Event Driven Actions Dump ---- [TOC00004-END] [TOC00005] ----- START DDE Actions Dump ----- Executing SYNC actions Executing ASYNC actions ----- END DDE Actions Dump (total 0 csec) ----- [TOC00005-END] End of Incident Dump [TOC00002-END] TOC00000 - Table of contents TOC00001 - Error Stack TOC00002 - Dump for incident 153 (oci 24550 [6]) | TOC00003 - Short Call Stack Trace | TOC00004 - START Event Driven Actions Dump | TOC00005 - START DDE Actions Dump End of TOC

laurenz commented 1 year ago

Thanks for your research! I have reproduced the problem:

kgefec: fatal error 0
<error barrier> at 0x7ffc46904818 placed kge.c@2524
OCI-21503: program terminated by fatal error
OCI-04030: out of process memory when trying to allocate 2097192 bytes (Alloc environm,ttcl2i: prefetch buf alloc)
2023-04-13 12:03:17.749 CEST [11384] LOG:  server process (PID 11406) was terminated by signal 11: Segmentation fault
2023-04-13 12:03:17.749 CEST [11384] DETAIL:  Failed process was running: SELECT * FROM t_clob

The stack trace is:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fba59572599 in kgdsdst () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
(gdb) where
#0  0x00007fba59572599 in kgdsdst () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#1  0x00007fba5a221dab in skgudmp () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#2  0x00007fba5b440e25 in kgefecl () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#3  0x00007fba5954a244 in kghnospc () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#4  0x00007fba5bfca060 in kghalo () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#5  0x00007fba5bec9c59 in kpuhhalo () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#6  0x00007fba5a763bb1 in ttcl2i_initpfch () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#7  0x00007fba5bf1a4b1 in ttcl2i () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#8  0x00007fba5bf362ff in ttcaGetCol () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#9  0x00007fba5bf354a0 in ttccfpg () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#10 0x00007fba5bf3316a in ttcfour () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#11 0x00007fba5bf24e48 in ttcdrv () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#12 0x00007fba5bef7169 in nioqwa () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#13 0x00007fba5bee0962 in upirtrc () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#14 0x00007fba5beed816 in kpurcsc () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#15 0x00007fba5bee5c6b in kpuexec () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#16 0x00007fba5bee00a9 in OCIStmtExecute () from /opt/oracle/product/18c/dbhomeXE/lib/libclntsh.so.18.1
#17 0x00007fba7340fa23 in oracleExecuteQuery (session=0x1fd08f8, oraTable=0x1fd0440, paramList=0x0, prefetch=10000) at oracle_utils.c:2149
#18 0x00007fba733f83ea in oracleIterateForeignScan (node=0x1fcf3a0) at oracle_fdw.c:1373
#19 0x00000000007a6aed in ForeignNext (node=0x1fcf3a0) at nodeForeignscan.c:62
#20 0x000000000078c3c9 in ExecScanFetch (node=0x1fcf3a0, accessMtd=0x7a6a37 <ForeignNext>, recheckMtd=0x7a6b40 <ForeignRecheck>) at execScan.c:133
#21 0x000000000078c442 in ExecScan (node=0x1fcf3a0, accessMtd=0x7a6a37 <ForeignNext>, recheckMtd=0x7a6b40 <ForeignRecheck>) at execScan.c:182
#22 0x00000000007a6c60 in ExecForeignScan (pstate=0x1fcf3a0) at nodeForeignscan.c:132
#23 0x0000000000787d51 in ExecProcNodeInstr (node=0x1fcf3a0) at execProcnode.c:480
#24 0x0000000000787d1e in ExecProcNodeFirst (node=0x1fcf3a0) at execProcnode.c:464
#25 0x000000000077ae66 in ExecProcNode (node=0x1fcf3a0) at ../../../src/include/executor/executor.h:268
#26 0x000000000077dcce in ExecutePlan (estate=0x1fcf128, planstate=0x1fcf3a0, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, 
    direction=ForwardScanDirection, dest=0x1ef5fe8, execute_once=true) at execMain.c:1636
[...]

That looks like a bug in Oracle: somebody (Oracle client or server) runs out of memory, and the Oracle client code does not handle that well, tries to access memory that isn't there and causes a segmentation fault, which crashes PostgreSQL.

There is nothing I can do about that on the PostgreSQL side. I wonder whether it would be a good idea to limit prefetch at something sensible like 1000. What do you think?

I'll try to reproduce your other problem now.

philflorent commented 1 year ago

Hi Laurenz, The 1st problem is not a real-world one and my conclusion matches yours. It is an Oracle bug but it's 99% useless to open a ticket at Oracle support for a 24550. My client 19c version is supported but my target 11gR2 database version is under "sustaining" support (combination 19c<=>11.2.0.4 WAS supported but it's not enough). Anyway even if my client-server combo was fully supported typical answer would be : you linked a program with our libraries and 24550 means YOU do have a bug , good luck. In this case, Oracle's answer would not be completely wrong : I really did something stupid to obtain that after all. Perhaps the oracle-fdw documentation can just mention that a wrong prefetch value can crash the PostgreSQL cluster ? We will have a huge RAM at Oracle side and ORA-4030 won't occur, it's our responsibility to test and to put an adequate value for prefetch, not the developer's.
The second problem is more problematic since I don't think something is really wrong in this case.
Best regards, Phil

laurenz commented 1 year ago

Ok, then I'll set the limit for prefetch to 1000.

I cannot reproduce the problem with count(1). It works as it should for me. Perhaps you can set client_min_messages = debug3 and see what debug messages you get when you run the test.

philflorent commented 1 year ago

It's noticed. 1000 it's huge, even 50 is enough with a decent latency. 1st problem solved. My test system is under maintenance, should be available tomorrow, I will post the result of an execution of the 2nd case with client_min_messages=debug3 asap .

philflorent commented 1 year ago

I reran the test with client_min_messages=debug3 but nothing new.

dh00=> set client_min_messages=debug3;
SET
dh00=*> select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
DEBUG:  rehashing catalog cache id 35 for pg_namespace; 9 tups, 4 buckets
LIGNE 1 : select count(1) from (select * from orcl_usr_dh00.t_clob lim...
                 ^
DEBUG:  oracle_fdw: plan foreign table scan
DEBUG:  oracle_fdw: set NLS_LANG=FRENCH_FRANCE.AL32UTF8
DEBUG:  oracle_fdw: set ORA_SDTZ=Europe/Paris
DEBUG:  oracle_fdw: begin remote transaction
DEBUG:  oracle_fdw: set savepoint s2
DEBUG:  oracle_fdw: remote query is: SELECT /*42ca762f9a79ea44*/ '1' FROM "USR_P                                                                                                             RODIGE31"."T_CLOB" r1
DEBUG:  oracle_fdw: begin foreign table scan on 33313
DEBUG:  oracle_fdw: execute query in foreign table scan
DEBUG:  oracle_fdw: get next row in foreign table scan
DEBUG:  oracle_fdw: end foreign table scan
 count
-------
     1
(1 ligne)

dh00=*> commit;
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
PANIC:  ERRORDATA_STACK_SIZE exceeded
erreur SYSCALL SSL : EOF détecté
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

Example of stack trace:

Trace file /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/trace/ora_1378_140212869130816.trc DDE: Flood control is not active 2023-04-13T16:21:47.285367+02:00 Incident 169 created, dump file: /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/incident/incdir_169/ora_1378_140212869130816_i169.trc oci-24550 [6] [[si_signo=6] [si_errno=0] [si_code=-6] [si_int=0] [si_ptr=(nil)] [si_addr=0x6900000562]] [] [] [] [] [] [] [] [] [] []

Dump file /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/incident/incdir_169/ora_1378_140212869130816_i169.trc [TOC00000] Jump to table of contents Dump continued from file: /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/trace/ora_1378_140212869130816.trc [TOC00001] oci-24550 [6] [[si_signo=6] [si_errno=0] [si_code=-6] [si_int=0] [si_ptr=(nil)] [si_addr=0x6900000562]] [] [] [] [] [] [] [] [] [] [] [TOC00001-END] [TOC00002] ========= Dump for incident 169 (oci 24550 [6]) ======== Tracing is in restricted mode!

at 0x7ffd3b0441e8 placed dbge.c@1317 [TOC00003] ----- Short Call Stack Trace ----- dbgexPhaseII()+1869<-dbgexProcessError()+1871<-dbgePostErrorDirectVaList_int()+2100<-dbgePostErrorDirect()+798<-kpeDbgSignalHandler()+299<-skgesig_sigactionHandler()+258<-__sighandler()<-pthread_key_delete()+332<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0<-0x00007F85DA41F3E0[TOC00003-END] [TOC00004] ----- START Event Driven Actions Dump ---- ---- END Event Driven Actions Dump ---- [TOC00004-END] [TOC00005] ----- START DDE Actions Dump ----- Executing SYNC actions Executing ASYNC actions ----- END DDE Actions Dump (total 0 csec) ----- [TOC00005-END] End of Incident Dump [TOC00002-END] TOC00000 - Table of contents TOC00001 - Error Stack TOC00002 - Dump for incident 169 (oci 24550 [6]) | TOC00003 - Short Call Stack Trace | TOC00004 - START Event Driven Actions Dump | TOC00005 - START DDE Actions Dump End of TOC Example of postgres log in my last test: 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 INSTRUCTION : commit; 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 ATTENTION: AbortTransaction while in ABORT state 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 ATTENTION: AbortTransaction while in ABORT state 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 ATTENTION: AbortTransaction while in ABORT state 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 17:54:39.170 CEST [2513] postgres@pil00 PANIC: ERRORDATA_STACK_SIZE exceeded 2023-04-13 17:54:39.374 CEST [1927] LOG: processus serveur (PID 2513) a été arrêté par le signal 6 : Abandon 2023-04-13 17:54:39.374 CEST [1927] DÉTAIL: Le processus qui a échoué exécutait : commit; I changed the Oracle client (19c => 11.2.0.4) and I reproduced. I changed the OS & RDBMS version (Debian 12/PostgreSQL 15 => Debian 11/PostgreSQL 13) and I reproduced.

laurenz commented 1 year ago

I wrote a commit to debug this; it is on the branch https://github.com/laurenz/oracle_fdw/tree/debug_582

It should write lines to stderr (the log file) that look like this:

--> oracleGetSession, add new connEntry = 0x1ef68c0
--> oracleEndTransaction, connEntry = 0x1ef68c0
   --> oracleEndTransaction, cache contains connEntry = 0x1ef68c0
--> oracleEndTransaction, connEntry = 0x1ef68c0
   --> oracleEndTransaction, cache contains connEntry = 0x1ef68c0
--> closeSession, remove connEntry = 0x1ef68c0

Can you run your test with that and see what is in the log file?

philflorent commented 1 year ago

Yes.

--> oracleGetSession, add new connEntry = 0x56007c008950 --> oracleEndTransaction, connEntry = 0x56007c008950 2023-04-13 20:37:19.906 CEST [4273] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 20:37:19.906 CEST [4273] postgres@pil00 INSTRUCTION : commit; --> oracleEndTransaction, connEntry = 0x56007c008950 2023-04-13 20:37:19.906 CEST [4273] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 20:37:19.907 CEST [4273] postgres@pil00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x56007c008950 2023-04-13 20:37:19.907 CEST [4273] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 20:37:19.907 CEST [4273] postgres@pil00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x56007c008950 2023-04-13 20:37:19.907 CEST [4273] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 20:37:19.907 CEST [4273] postgres@pil00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x56007c008950 2023-04-13 20:37:19.907 CEST [4273] postgres@pil00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-13 20:37:19.907 CEST [4273] postgres@pil00 PANIC: ERRORDATA_STACK_SIZE exceeded 2023-04-13 20:37:19.942 CEST [1927] LOG: processus serveur (PID 4273) a été arrêté par le signal 6 : Abandon 2023-04-13 20:37:19.942 CEST [1927] DÉTAIL: Le processus qui a échoué exécutait : commit; 2023-04-13 20:37:19.942 CEST [1927] LOG: arrêt des autres processus serveur actifs 2023-04-13 20:37:19.943 CEST [2740] postgres@traqueur ATTENTION: arrêt de la connexion à cause de l'arrêt brutal d'un autre processus serveur 2023-04-13 20:37:19.943 CEST [2740] postgres@traqueur DÉTAIL: Le postmaster a commandé à ce processus serveur d'annuler la transaction courante et de quitter car un autre processus serveur a quitté anormalement et qu'il existe probablement de la mémoire partagée corrompue. 2023-04-13 20:37:19.943 CEST [2740] postgres@traqueur ASTUCE : Dans un moment, vous devriez être capable de vous reconnecter à la base de données et de relancer votre commande. 2023-04-13 20:37:19.943 CEST [2740] postgres@traqueur CONTEXTE : instruction SQL « SELECT CASE WHEN clock_timestamp() - embedded_transaction_timestamp < INTERVAL '10000 millisecond' THEN (SELECT true FROM pg_sleep_for(INTERVAL ' 10000 millisecond'-(clock_timestamp() - embedded_transaction_timestamp))) ELSE false END » fonction PL/pgSQL pg_temp_3.traqueur_collection(), ligne 100 à PERFORM 2023-04-13 20:37:19.947 CEST [2582] ATTENTION: arrêt de la connexion à cause de l'arrêt brutal d'un autre processus serveur 2023-04-13 20:37:19.947 CEST [2582] DÉTAIL: Le postmaster a commandé à ce processus serveur d'annuler la transaction courante et de quitter car un autre processus serveur a quitté anormalement et qu'il existe probablement de la mémoire partagée corrompue. 2023-04-13 20:37:19.947 CEST [2582] ASTUCE : Dans un moment, vous devriez être capable de vous reconnecter à la base de données et de relancer votre commande. 2023-04-13 20:37:19.959 CEST [1927] LOG: tous les processus serveur sont arrêtés ; réinitialisation 2023-04-13 20:37:20.783 CEST [4291] LOG: le système de bases de données a été interrompu ; dernier lancement connu à 2023-04-13 20:34:43 CEST 2023-04-13 20:37:20.784 CEST [4292] postgres@traqueur FATAL: le système de bases de données est en cours de restauration 2023-04-13 20:37:20.844 CEST [4291] LOG: le système de bases de données n'a pas été arrêté proprement ; restauration automatique en cours 2023-04-13 20:37:20.849 CEST [4291] LOG: la ré-exécution commence à 0/215F800 2023-04-13 20:37:20.850 CEST [4291] LOG: longueur invalide de l'enregistrement à 0/2174158 : voulait 24, a eu 0 2023-04-13 20:37:20.850 CEST [4291] LOG: ré-exécution faite à 0/2174130 2023-04-13 20:37:20.880 CEST [4304] postgres@traqueur FATAL: le système de bases de données est en cours de restauration 2023-04-13 20:37:20.885 CEST [1927] LOG: le système de bases de données est prêt pour accepter les connexions

laurenz commented 1 year ago

Strange, there is no connection entry in the cache.

I have added a commit that writes some more output. Could you pull the latest version from debug_582 and try again? Sorry if that is cumbersome. It is difficult to debug a problem that you cannot reproduce.

philflorent commented 1 year ago

No problem here is the new trace:

--> oracleGetSession, add new envEntry = 0x55ba730111f0 --> oracleGetSession, add new srvEntry = 0x55ba73079770 --> oracleGetSession, add new connEntry = 0x55ba730799e0 --> oracleEndTransaction, connEntry = 0x55ba730799e0 2023-04-14 10:58:47.270 CEST [6605] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 10:58:47.270 CEST [6605] postgres@dh00 INSTRUCTION : commit; --> oracleEndTransaction, connEntry = 0x55ba730799e0 2023-04-14 10:58:47.271 CEST [6605] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 10:58:47.271 CEST [6605] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55ba730799e0 2023-04-14 10:58:47.271 CEST [6605] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 10:58:47.271 CEST [6605] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55ba730799e0 2023-04-14 10:58:47.271 CEST [6605] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 10:58:47.271 CEST [6605] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55ba730799e0 2023-04-14 10:58:47.271 CEST [6605] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 10:58:47.271 CEST [6605] postgres@dh00 PANIC: ERRORDATA_STACK_SIZE exceeded 2023-04-14 10:58:47.307 CEST [874] LOG: processus serveur (PID 6605) a été arrêté par le signal 6 : Abandon 2023-04-14 10:58:47.307 CEST [874] DÉTAIL: Le processus qui a échoué exécutait : commit; 2023-04-14 10:58:47.307 CEST [874] LOG: arrêt des autres processus serveur actifs 2023-04-14 10:58:47.338 CEST [874] LOG: tous les processus serveur sont arrêtés ; réinitialisation 2023-04-14 10:58:48.253 CEST [6612] LOG: le système de bases de données a été interrompu ; dernier lancement connu à 2023-04-14 10:55:12 CEST 2023-04-14 10:58:48.255 CEST [6615] postgres@traqueur FATAL: le système de bases de données est en cours de restauration

philflorent commented 1 year ago

What is your test system ? Postgres & Oracle on the same server with a 18c XE accessed from oracle_fdw via beq protocol ? I changed the Oracle client and the OS at postgres side but if you want I could try to access a 19c database to see if problem could be 11.2.0.4 at orcl side.

laurenz commented 1 year ago

My setup

Fedora Linux 37
PostgreSQL 16devel and PostgreSQL 15.2
Oracle Database 18c Express Edition Release 18.0.0.0.0 - Production

I have tried it with bequeath protocol and via the listener, although that should not make a difference at all.

Your messages show that the static variable

static struct envEntry *envlist = NULL;

from oracle_utils.c is set to 0x55ba730111f0, never modified (the only place in the code where that happens is removeEnvironment, and there is no such log entry), but later in oracleEndTransaction it is NULL again.

The only explanation would be a bug in my code that writes over stack allocated memory where it shouldn't. I ran my code with valgrind, but saw nothing suspicious besides the thousands of reports that Oracle always produces.

I am getting quite desperate with this. I have pushed another commit that make the code scream as soon as envlist becomes NULL unexpectedly. Could you give it another try?

philflorent commented 1 year ago

Yes.

--> checkerr: envlist is NULL --> checkerr: envlist is NULL --> oracleGetSession, add new envEntry = 0x55ba73011910 --> oracleGetSession, add new srvEntry = 0x55ba73079e90 --> oracleGetSession, add new connEntry = 0x55ba7307a100 --> checkerr: envlist is NULL --> checkerr: envlist is NULL --> oracleFetchNext: envlist is NULL --> oracleCloseStatement: envlist is NULL --> freeHandle: envlist is NULL, handlep = 0x55ba730647f8 --> oracleEndSubtransaction: envlist is NULL --> oracleEndSubtransaction: envlist is NULL --> oracleEndTransaction, connEntry = 0x55ba7307a100 2023-04-14 17:00:53.722 CEST [8941] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:00:53.722 CEST [8941] postgres@dh00 INSTRUCTION : commit; --> oracleEndTransaction, connEntry = 0x55ba7307a100 2023-04-14 17:00:53.722 CEST [8941] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:00:53.723 CEST [8941] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55ba7307a100 2023-04-14 17:00:53.723 CEST [8941] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:00:53.723 CEST [8941] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55ba7307a100 2023-04-14 17:00:53.723 CEST [8941] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:00:53.723 CEST [8941] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55ba7307a100 2023-04-14 17:00:53.723 CEST [8941] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:00:53.723 CEST [8941] postgres@dh00 PANIC: ERRORDATA_STACK_SIZE exceeded 2023-04-14 17:00:53.756 CEST [874] LOG: processus serveur (PID 8941) a été arrêté par le signal 6 : Abandon 2023-04-14 17:00:53.756 CEST [874] DÉTAIL: Le processus qui a échoué exécutait : commit;

laurenz commented 1 year ago

That looks like the problem happens in oracleExecuteQuery.

I have added some more messages, can you try again?

philflorent commented 1 year ago

Yes asap. But notice something new. I built a 19c database with the very same table and I cannot reproduce.

dh00=# select count(1) from (select * from orcl2_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

dh00=*# commit;
COMMIT
dh00=# select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

dh00=*# commit;
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
PANIC:  ERRORDATA_STACK_SIZE exceeded
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

philflorent commented 1 year ago

KO execution 11.2:

--> checkerr: envlist is NULL --> checkerr: envlist is NULL --> oracleGetSession, add new envEntry = 0x55570cd840f0 --> oracleGetSession, add new srvEntry = 0x55570cdd3730 --> oracleGetSession, add new connEntry = 0x55570cdd39a0 --> oracleExecuteQuery: envlist is not NULL in the beginning --> checkerr: envlist is NULL --> oracleExecuteQuery: envlist is NULL after OCIStmtExecute --> checkerr: envlist is NULL --> oracleExecuteQuery: envlist is NULL at the end --> oracleFetchNext: envlist is NULL --> oracleCloseStatement: envlist is NULL --> freeHandle: envlist is NULL, handlep = 0x55570cdbdc98 --> oracleEndSubtransaction: envlist is NULL --> oracleEndSubtransaction: envlist is NULL --> oracleEndTransaction, connEntry = 0x55570cdd39a0 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 INSTRUCTION : commit; --> oracleEndTransaction, connEntry = 0x55570cdd39a0 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55570cdd39a0 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55570cdd39a0 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 ATTENTION: AbortTransaction while in ABORT state --> oracleEndTransaction, connEntry = 0x55570cdd39a0 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-14 17:56:32.338 CEST [2152] postgres@dh00 PANIC: ERRORDATA_STACK_SIZE exceeded 2023-04-14 17:56:32.344 CEST [1503] LOG: processus serveur (PID 2152) a été arrêté par le signal 6 : Abandon 2023-04-14 17:56:32.344 CEST [1503] DÉTAIL: Le processus qui a échoué exécutait : commit;

philflorent commented 1 year ago

OK execution 19c :

--> checkerr: envlist is NULL --> checkerr: envlist is NULL --> oracleGetSession, add new envEntry = 0x55570cd84870 --> oracleGetSession, add new srvEntry = 0x55570cdd40a0 --> oracleGetSession, add new connEntry = 0x55570cdd41a0 --> oracleExecuteQuery: envlist is not NULL in the beginning --> oracleEndTransaction, connEntry = 0x55570cdd41a0 --> oracleEndTransaction, envlist has envEntry = 0x55570cd84870 --> oracleEndTransaction, envEntry has srvEntry = 0x55570cdd40a0 --> oracleEndTransaction, srvEntry has connEntry = 0x55570cdd41a0 --> oracleEndTransaction, connEntry = 0x55570cdd41a0 --> oracleEndTransaction, envlist has envEntry = 0x55570cd84870 --> oracleEndTransaction, envEntry has srvEntry = 0x55570cdd40a0 --> oracleEndTransaction, srvEntry has connEntry = 0x55570cdd41a0 --> closeSession, remove connEntry = 0x55570cdd41a0 --> disconnectServer, remove srvEntry = 0x55570cdd40a0 --> removeEnvironment, remove envEntry = 0x55570cd84870

philflorent commented 1 year ago

Coud be OCIStmtRelease that crashes ? Old stuff but there was something "new" in 11.2 about that. Code "allocate transaction handle", "start transaction", "do some work", "commit and free transaction handle", "release statement" is KO in 11.2 . OK code is "allocate transaction handle", "start transaction", "do some work", "commit and free transaction handle", "+++clear invalid transaction handle with occiattrset+++" then release statement

laurenz commented 1 year ago

No, the latest debug output shows that the stack is smashed after OCIStmtExecute.

My last straw is that valgrind shows something in your crash case. Stop PostgreSQL and start it in single user mode with

valgrind --log-file=valgrind.log --error-limit=no --leak-check=no \
  --track-origins=yes --free-fill=AB --malloc-fill=CD \
  /path/to/bin/postgres --single -D /path/to/datadir dbname

Then enter the commands (dbuser is your database user):

set session authorization dbuser
begin
select count(1) from (select * from orcl2_usr_dh00.t_clob limit 1) r
commit

End the session with CTRL+D.

I'd like to see the resulting valgrind.log. It will be too big to post, please send it by e-mail.

philflorent commented 1 year ago

Done (test with db & client orcl 11.2), sent valgrind.log by mail.

philflorent commented 1 year ago

Reran the crash case with a 19c client, also sent the result by mail.

laurenz commented 1 year ago

Thanks. I also tried with Instantclient 19.18, and it worked fine here :^/

I couldn't really nail down where the bug is from your Oracle 11 valgrind run, but the messages start looking odd early on, at latest here:

==3561== Conditional jump or move depends on uninitialised value(s)
==3561==    at 0x12ADC820: checkerr (oracle_utils.c:2939)
==3561==    by 0x12ADEB64: oracleDescribe (oracle_utils.c:956)
==3561==    by 0x12AD3460: getFdwState (oracle_fdw.c:2783)
==3561==    by 0x12AD8E4B: oracleGetForeignRelSize (oracle_fdw.c:819)
==3561==    by 0x485E7F: set_foreign_size (allpaths.c:912)
==3561==    by 0x485E7F: set_rel_size (allpaths.c:396)
==3561==    by 0x488048: set_base_rel_sizes (allpaths.c:326)
==3561==    by 0x488048: make_one_rel (allpaths.c:188)
==3561==    by 0x4AB772: query_planner (planmain.c:276)
==3561==    by 0x4B0CCB: grouping_planner (planner.c:1470)
==3561==    by 0x4B396F: subquery_planner (planner.c:1047)
==3561==    by 0x4855F4: set_subquery_pathlist (allpaths.c:2601)
==3561==    by 0x4855F4: set_rel_size (allpaths.c:425)
==3561==    by 0x488048: set_base_rel_sizes (allpaths.c:326)
==3561==    by 0x488048: make_one_rel (allpaths.c:188)
==3561==    by 0x4AB772: query_planner (planmain.c:276)
==3561==  Uninitialised value was created by a stack allocation
==3561==    at 0x131477AA: ztcrsgstk (in /opt/oracle/instantclient/libclntsh.so.11.1)

So the bug must be pretty early on. Perhaps the problem was really introduced by e8a3317e6c4f7e4947123e59f2b346b57bf945a3. One thing you could do is to check out the code at this commit and try to reproduce the problem. If that fails too, try with the previous commit 65073da6fcad09f248d1cbdeb32beee6c9f09032.

Your valgrind run with the Oracle 19 client crashes at a different place, while or immediately after it connects to Oracle. Can you confirm that? What does the log say?

philflorent commented 1 year ago

"Your valgrind run with the Oracle 19 client crashes at a different place, while or immediately after it connects to Oracle. Can you confirm that? What does the log say?" From the user point of view, it crashes at commit, client version makes no difference. Postgres log is exactly the same too.

_Perhaps the problem was really introduced by https://github.com/laurenz/oracle_fdw/commit/e8a3317e6c4f7e4947123e59f2b346b57bf945a3. One thing you could do is to check out the code at this commit and try to reproduce the problem. If that fails too, try with the previous commit https://github.com/laurenz/oracle_fdw/commit/65073da6fcad09f248d1cbdeb32beee6c9f09032._ I will use git bisect for that asap.

philflorent commented 1 year ago

Complicated, it's the "big" commit.

d679a73d02057fd674fc78fe5f25183e5c9a172b is the first bad commit commit d679a73d02057fd674fc78fe5f25183e5c9a172b Author: Laurenz Albe laurenz.albe@cybertec.at Date: Mon Mar 13 06:16:10 2023 +0100

Use the array interface instead of OCI prefetching

philflorent commented 1 year ago

I noticed a simple workaround to avoid the crash with the 11.2 db is to set prefetch=1 (no matter what lob_prefetch is)

postgres=# alter foreign table orcl_usr_dh00.t_clob  options(set prefetch '1');
ALTER FOREIGN TABLE
postgres=*# commit;
COMMIT
postgres=# select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

postgres=*# commit;
COMMIT
postgres=# alter foreign table orcl_usr_dh00.t_clob  options(set prefetch '2');
ALTER FOREIGN TABLE
postgres=*# commit;
COMMIT
postgres=# select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

postgres=*# commit;
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

Of course it's not a "good" workaround since the goal is to prefetch but perhaps it can give a hint.

laurenz commented 1 year ago

Thanks for your diligent investigation.

I have reason to believe that I may have found the root of the problem. Could you try pulling commit 6fe628d2bd66954a27b6c459e69df9865bdf690c and see if that fixes the problem?

philflorent commented 1 year ago

Hi Laurenz,

Still the same issue with 11.2 (still not reproductible with 19c)

postgres=# \d+ orcl_usr_dh00.t_clob
                                                           Table distante « orcl_usr_dh00.t_clob »
     Colonne      |              Type              | Collationnement | NULL-able | Par défaut | Options FDW  | Stockage | Cible de statistiques | Description
------------------+--------------------------------+-----------------+-----------+------------+--------------+----------+-----------------------+-------------
 c1               | numeric                        |                 | not null  |            | (key 'true') | main     |                       |
 val1             | text                           |                 |           |            |              | extended |                       |
 c2_date_creation | timestamp(0) without time zone |                 |           |            |              | plain    |                       |
 c3_date_maj      | timestamp(0) without time zone |                 |           |            |              | plain    |                       |
 val2             | text                           |                 |           |            |              | extended |                       |
Serveur : orcl_usr_dh00
Options FDW : (prefetch '2', schema 'USR_PRODIGE31', "table" 'T_CLOB', lob_prefetch '32768')

postgres=# select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

postgres=*# commit;
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

2023-04-17 08:38:59.385 CEST [33932] postgres@postgres ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-17 08:38:59.385 CEST [33932] postgres@postgres INSTRUCTION : commit; 2023-04-17 08:38:59.385 CEST [33932] postgres@postgres ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-17 08:38:59.385 CEST [33932] postgres@postgres ATTENTION: AbortTransaction while in ABORT state 2023-04-17 08:38:59.386 CEST [33932] postgres@postgres ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-17 08:38:59.386 CEST [33932] postgres@postgres ATTENTION: AbortTransaction while in ABORT state 2023-04-17 08:38:59.386 CEST [33932] postgres@postgres ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-17 08:38:59.386 CEST [33932] postgres@postgres ATTENTION: AbortTransaction while in ABORT state 2023-04-17 08:38:59.386 CEST [33932] postgres@postgres ERREUR: oracleEndTransaction internal error: handle not found in cache 2023-04-17 08:38:59.386 CEST [33932] postgres@postgres PANIC: ERRORDATA_STACK_SIZE exceeded 2023-04-17 08:38:59.393 CEST [33796] LOG: processus serveur (PID 33932) a été arrêté par le signal 6 : Abandon 2023-04-17 08:38:59.393 CEST [33796] DÉTAIL: Le processus qui a échoué exécutait : commit;

Stack trace of the 24550:

DDE: Flood control is not active 2023-04-17T08:53:29.470461+02:00 Incident 265 created, dump file: /var/tmp/oradiag_postgres/diag/clients/user_postgres/host_1559512114_110/incident/incdir_265/ora_34417_140646438574656_i265.trc oci-24550 [6] [[si_signo=6] [si_errno=0] [si_code=-6] [si_int=0] [si_ptr=(nil)] [si_addr=0x6900008671]] [] [] [] [] [] [] [] [] [] []

Starting a Diag Context default dump (level=3)

at 0x7ffded5e03e8 placed dbge.c@1317 ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- dbgc_dmp()+127 call kgdsdst() 7FFDED5DE8E0 000000003 7FFDED5D8C20 ? 7FFDED5D8D38 ? 000000000 000000083 ? dbgexPhaseII()+2092 call dbgc_dmp() 7FEAC027DF10 7FEAC027E1E0 7FFDED5D8C20 ? 7FFDED5D8D38 ? 000000000 ? 000000083 ? dbgexProcessError() call dbgexPhaseII() 557630257310 55763027DE30 +1871 7FFDED5DFB01 7FFDED5D8D38 ? 000000000 ? 000000083 ? dbgePostErrorDirect call dbgexProcessError() 557630257310 55763027DE30 VaList_int()+2100 000000000 000000000 000000000 ? 000000083 ? dbgePostErrorDirect call dbgePostErrorDirect 557630257310 7FEAC0D43A28 ()+798 VaList_int() 000005FE6 000000001 000000002 7FFDED5E0878 kpeDbgSignalHandler call dbgePostErrorDirect 557630257310 ? 7FEAC0D43A28 ? ()+299 () 000005FE6 ? 000000001 ? 000000002 ? 000000000 skgesig_sigactionHa call kpeDbgSignalHandler 000000001 557630257310 ndler()+258 () 000005FE6 ? 000000001 ? 000000002 ? 000000000 ? __sighandler() call skgesig_sigactionHa 000000020 7FFDED5E1FF0 ndler() 7FFDED5E1EC0 000000001 ? 000000002 ? 000000000 ? pthread_key_delete( signal __sighandler() 000008671 000008671 000000006 )+332 7FEACD6ABCCC ? 000000000 ? 000000000 ? 00007FFDED5E253C call pthread_key_delete( 000008671 ? 000008671 ? ) 000000006 ? 7FEACD6ABCCC ? 000000000 ? 000000000 ?

philflorent commented 1 year ago

Strace OK :

sendto(8, "T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"..., 63, 0, NULL, 0) = 63 recvfrom(8, "Q\0\0\0(RELEASE pg_psql_temporary_s"..., 8192, 0, NULL, NULL) = 41 sendto(8, "C\0\0\0\fRELEASE\0Z\0\0\0\5T", 19, 0, NULL, 0) = 19 recvfrom(8, 0x5583d0fce380, 8192, 0, NULL, NULL) = -1 EAGAIN (Ressource temporairement non disponible) epoll_wait(4, [{events=EPOLLIN, data={u32=3512658168, u64=94024641734904}}], 1, -1) = 1 recvfrom(8, "Q\0\0\0*SAVEPOINT pg_psql_temporary"..., 8192, 0, NULL, NULL) = 43 sendto(8, "C\0\0\0\16SAVEPOINT\0Z\0\0\0\5T", 21, 0, NULL, 0) = 21 recvfrom(8, "Q\0\0\0\fcommit;\0", 8192, 0, NULL, NULL) = 13 times(NULL) = 1743273544 lseek(11, 0, SEEK_CUR) = 150613 lseek(11, 0, SEEK_CUR) = 150613 write(11, "[17-APR-2023 15:26:51:320] nioqs"..., 41) = 41 lseek(11, 0, SEEK_CUR) = 150654 lseek(11, 0, SEEK_CUR) = 150654 write(11, "[17-APR-2023 15:26:51:321] nioqs"..., 40) = 40 lseek(11, 0, SEEK_CUR) = 150694 lseek(11, 0, SEEK_CUR) = 150694 write(11, "[17-APR-2023 15:26:51:322] nioqr"..., 41) = 41 lseek(11, 0, SEEK_CUR) = 150735 lseek(11, 0, SEEK_CUR) = 150735 write(11, "[17-APR-2023 15:26:51:322] nsbas"..., 46) = 46 lseek(11, 0, SEEK_CUR) = 150781 lseek(11, 0, SEEK_CUR) = 150781 write(11, "[17-APR-2023 15:26:51:323] nsbas"..., 56) = 56 lseek(11, 0, SEEK_CUR) = 150837 lseek(11, 0, SEEK_CUR) = 150837 write(11, "[17-APR-2023 15:26:51:324] nttfp"..., 42) = 42 write(12, "\0\0\0$\6\0\0\0\0\0\21i\r\376\377\377\377\377\377\377\377\1\0\0\0\0\0\0\0\1\0\0"..., 36) = 36 lseek(11, 0, SEEK_CUR) = 150879 lseek(11, 0, SEEK_CUR) = 150879 write(11, "[17-APR-2023 15:26:51:324] nttfp"..., 67) = 67 lseek(11, 0, SEEK_CUR) = 150946 lseek(11, 0, SEEK_CUR) = 150946 write(11, "[17-APR-2023 15:26:51:325] nttfp"..., 41) = 41 lseek(11, 0, SEEK_CUR) = 150987 lseek(11, 0, SEEK_CUR) = 150987 write(11, "[17-APR-2023 15:26:51:326] nsbas"..., 52) = 52 lseek(11, 0, SEEK_CUR) = 151039 lseek(11, 0, SEEK_CUR) = 151039 lseek(11, 0, SEEK_CUR) = 151039 lseek(11, 0, SEEK_CUR) = 151039 lseek(11, 0, SEEK_CUR) = 151039 write(11, "[17-APR-2023 15:26:51:326] nsbas"..., 380) = 380 lseek(11, 0, SEEK_CUR) = 151419 lseek(11, 0, SEEK_CUR) = 151419 write(11, "[17-APR-2023 15:26:51:328] nsbas"..., 49) = 49 lseek(11, 0, SEEK_CUR) = 151468 lseek(11, 0, SEEK_CUR) = 151468 write(11, "[17-APR-2023 15:26:51:328] nsbas"..., 63) = 63 lseek(11, 0, SEEK_CUR) = 151531 lseek(11, 0, SEEK_CUR) = 151531 write(11, "[17-APR-2023 15:26:51:329] nttfp"..., 42) = 42 read(12, "\0\0\0\21\6 \0\0\0\0\t\5\0\0\0h\0", 8208) = 17 lseek(11, 0, SEEK_CUR) = 151573 lseek(11, 0, SEEK_CUR) = 151573 write(11, "[17-APR-2023 15:26:51:330] nttfp"..., 64) = 64 lseek(11, 0, SEEK_CUR) = 151637 lseek(11, 0, SEEK_CUR) = 151637 write(11, "[17-APR-2023 15:26:51:330] nttfp"..., 41) = 41 lseek(11, 0, SEEK_CUR) = 151678 lseek(11, 0, SEEK_CUR) = 151678 write(11, "[17-APR-2023 15:26:51:331] nsbas"..., 56) = 56 lseek(11, 0, SEEK_CUR) = 151734 lseek(11, 0, SEEK_CUR) = 151734 write(11, "[17-APR-2023 15:26:51:332] nsbas"..., 56) = 56 lseek(11, 0, SEEK_CUR) = 151790 lseek(11, 0, SEEK_CUR) = 151790 write(11, "[17-APR-2023 15:26:51:332] nsbas"..., 52) = 52 lseek(11, 0, SEEK_CUR) = 151842 lseek(11, 0, SEEK_CUR) = 151842 lseek(11, 0, SEEK_CUR) = 151842 write(11, "[17-APR-2023 15:26:51:333] nsbas"..., 228) = 228 lseek(11, 0, SEEK_CUR) = 152070 lseek(11, 0, SEEK_CUR) = 152070 write(11, "[17-APR-2023 15:26:51:334] nsbas"..., 73) = 73 lseek(11, 0, SEEK_CUR) = 152143 lseek(11, 0, SEEK_CUR) = 152143 write(11, "[17-APR-2023 15:26:51:334] nioqr"..., 40) = 40 sendto(8, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 18, 0, NULL, 0) = 18 recvfrom(8, 0x5583d0fce380, 8192, 0, NULL, NULL) = -1 EAGAIN (Ressource temporairement non disponible) epoll_wait(4,

Strace KO :

lseek(41, 0, SEEK_CUR) = 135096 write(41, "[17-APR-2023 14:17:42:830] nsbas"..., 3572) = 3572 lseek(41, 0, SEEK_CUR) = 138668 lseek(41, 0, SEEK_CUR) = 138668 write(41, "[17-APR-2023 14:17:42:840] nsbas"..., 76) = 76 lseek(41, 0, SEEK_CUR) = 138744 lseek(41, 0, SEEK_CUR) = 138744 write(41, "[17-APR-2023 14:17:42:840] nioqr"..., 40) = 40 sendto(8, "T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"..., 63, 0, NULL, 0) = 63 recvfrom(8, "Q\0\0\0(RELEASE pg_psql_temporary_s"..., 8192, 0, NULL, NULL) = 41 sendto(8, "C\0\0\0\fRELEASE\0Z\0\0\0\5T", 19, 0, NULL, 0) = 19 recvfrom(8, 0x5583d0fce380, 8192, 0, NULL, NULL) = -1 EAGAIN (Ressource temporairement non disponible) epoll_wait(4, 0x5583d15ee940, 1, -1) = -1 EINTR (Appel système interrompu) --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- kill(37324, SIGURG) = 0 rt_sigreturn({mask=[URG]}) = -1 EINTR (Appel système interrompu) recvfrom(8, 0x5583d0fce380, 8192, 0, NULL, NULL) = -1 EAGAIN (Ressource temporairement non disponible) epoll_wait(4, [{events=EPOLLIN, data={u32=3512658192, u64=94024641734928}}], 1, -1) = 1 read(9, "\27\0\0\0\0\0\0\0\0\0\0\0\314\221\0\0i\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 epoll_wait(4, [{events=EPOLLIN, data={u32=3512658168, u64=94024641734904}}], 1, -1) = 1 recvfrom(8, "Q\0\0\0*SAVEPOINT pg_psql_temporary"..., 8192, 0, NULL, NULL) = 43 sendto(8, "C\0\0\0\16SAVEPOINT\0Z\0\0\0\5T", 21, 0, NULL, 0) = 21 recvfrom(8, "Q\0\0\0\fcommit;\0", 8192, 0, NULL, NULL) = 13 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.913 CEST [37"..., 205) = 205 sendto(8, "E\0\0\0|SERREUR\0VERROR\0CHV000\0Morac"..., 125, 0, NULL, 0) = 125 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.914 CEST [37"..., 127) = 127 sendto(8, "E\0\0\0|SERREUR\0VERROR\0CHV000\0Morac"..., 125, 0, NULL, 0) = 125 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.915 CEST [37"..., 105) = 105 sendto(8, "N\0\0\0gSATTENTION\0VWARNING\0C01000\0"..., 104, 0, NULL, 0) = 104 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.916 CEST [37"..., 127) = 127 sendto(8, "E\0\0\0|SERREUR\0VERROR\0CHV000\0Morac"..., 125, 0, NULL, 0) = 125 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.917 CEST [37"..., 105) = 105 sendto(8, "N\0\0\0gSATTENTION\0VWARNING\0C01000\0"..., 104, 0, NULL, 0) = 104 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.918 CEST [37"..., 127) = 127 sendto(8, "E\0\0\0|SERREUR\0VERROR\0CHV000\0Morac"..., 125, 0, NULL, 0) = 125 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.919 CEST [37"..., 105) = 105 sendto(8, "N\0\0\0gSATTENTION\0VWARNING\0C01000\0"..., 104, 0, NULL, 0) = 104 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.919 CEST [37"..., 127) = 127 sendto(8, "E\0\0\0|SERREUR\0VERROR\0CHV000\0Morac"..., 125, 0, NULL, 0) = 125 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 write(2, "2023-04-17 14:18:47.920 CEST [37"..., 93) = 93 sendto(8, "E\0\0\0PSPANIC\0VPANIC\0CXX000\0MERROR"..., 81, 0, NULL, 0) = 81 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 gettid() = 37324 getpid() = 37324 tgkill(37324, 37324, SIGABRT) = 0 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=37324, si_uid=105} --- +++ killed by SIGABRT +++

laurenz commented 1 year ago

Didn't you write further up that you also get crashes with Oracle client 19? If the problem happens only with one special version or Oracle, it might be an Oracle problem. Otherwise, it is almost certainly a problem in my code.

I just tried with Oracle Instant Client 11, and could not reproduce the problem.

I am afraid I'll have to give up on this. Memory problems on the stack are hard enough to debug if you can reproduce them. I just don't know what else to do.

Thanks for all the great help you provided so far.

philflorent commented 1 year ago

Laurenz, Instant Client 11.2.0.4 & Oracle Database 11.2.0.4 => KO Instant Client 19c & Oracle Database 11.2.0.4 => KO Instant Client 11.2.0.4 & Oracle Database 19c => OK Instant Client 19c & Oracle Database 19c => OK

=> Issue only occurs with Oracle 11.2 at DB side, no matter the client version. I know 11g can be a bit special in terms of OCI developing. I remembered I had problems with OCI code that worked OK with10.2 and failed with 11.2 I had hopes about deactivation of DDE features but no luck. I tried to patch Oracle with the latest patchset update I have, tried to deactivate the diagnosibility features, to deactivate encryption but everything failed.

I can install a 11gR2 XE (it is in fact a 11.2.0.2) since I know you have only access to XE editions. Notice that in this case 19c client cannot be used and a failure would be expected (19c requires at least 11.2.0.3 to communicate). 23c was also mentioned and I will test asap. If I had problems you can download & use it freely as a developer but it's less important since this version is "fresh", 11.2 & 19c are more important. Best regards, Phil

laurenz commented 1 year ago

I looked for Oracle 11g Express Edition, but it does not seem to be available for download.

philflorent commented 1 year ago

Will provide a link asap, it's not a problem with XE.

philflorent commented 1 year ago

Good news. I can reproduce with 11.2.0.2 XE.

Here t_clob is a 2 lines versions of the t_clob table with:

insert into t_clob values(1, dbms_random.string('L', 2000), sysdate-((100000-1)/86400), sysdate-((100000-1)/86400) + 1/1440, dbms_random.string('L', 6000));
insert into t_clob  values(2, dbms_random.string('L', 2000), sysdate-((100000-2)/86400), sysdate-((100000-2)/86400) + 1/1440, dbms_random.string('L', 6000));
commit;

(useless to run the recursive query and it will probably fail with XE)

postgres=*# select oracle_diag('orcl3_usr_dh00');
                                                     oracle_diag
----------------------------------------------------------------------------------------------------------------------
 oracle_fdw 2.6.0devel, PostgreSQL 15.2 (Debian 15.2-2.pgdg120+1), Oracle client 11.2.0.4.0, Oracle server 11.2.0.2.0
(1 ligne)

postgres=*# \d+ orcl3_usr_dh00.t_clob
                                                           Table distante « orcl3_usr_dh00.t_clob »
     Colonne      |              Type              | Collationnement | NULL-able | Par défaut | Options FDW  | Stockage | Cible de statistiques | Description
------------------+--------------------------------+-----------------+-----------+------------+--------------+----------+-----------------------+-------------
 c1               | numeric                        |                 | not null  |            | (key 'true') | main     |                       |
 val1             | text                           |                 |           |            |              | extended |                       |
 c2_date_creation | timestamp(0) without time zone |                 |           |            |              | plain    |                       |
 c3_date_maj      | timestamp(0) without time zone |                 |           |            |              | plain    |                       |
 val2             | text                           |                 |           |            |              | extended |                       |
Serveur : orcl3_usr_dh00
Options FDW : (schema 'USR_PRODIGE', "table" 'T_CLOB', prefetch '2')

postgres=*# select count(1) from (select * from orcl3_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

postgres=*# commit;
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ERREUR:  oracleEndTransaction internal error: handle not found in cache
PANIC:  ERRORDATA_STACK_SIZE exceeded
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

philflorent commented 1 year ago

I did not run a complete git bisect but:

git checkout e8a3317
make
make install
pg_ctlcluster dh_p restart

postgres=# select count(1) from (select * from orcl3_usr_dh00.t_clob limit 1) r;
 count
-------
     1
(1 ligne)

postgres=*# commit;
COMMIT

=> 11.2.0.2 XE is as good as 11.2.0.4 EE to reproduce the issue if 11.2.0.4 client is used. OK & KO happen in the same cases.

laurenz commented 1 year ago

Thanks again for all the help! I'll see what I can do.

laurenz commented 1 year ago

I have run the transaction with Oracle 11.2 XE, and I got no crash.

There is one last thing I can and will do: go through d679a73d02057fd674fc78fe5f25183e5c9a172b line by line and see if I can spot anything that can cause a stack overrun.

philflorent commented 1 year ago

Hi Laurenz, It's noticed, will test immediately what you provide. Don't hesitate to make me test even if you think change has a 0.1% percent chance to work. It's easy to spot the line that causes the final crash but it's more difficult to spot the real cause. I ran a "stupid" test by commenting a few lines in oracleendtransaction and of course there is no crash but not freeing the handles is not a solution. The real cause, if the cause is in the code which is unsure, is earlier in the execution.
Currently: 11.2.0.4=>11.2..0..2 crash 11.2.0.4=>11.2..0..4 crash 11.2.0.4=>11.2..0..4 + latest PSU crash 19c =>11.2..0..4 crash 19c =>11.2..0..4 + latest PSU crash 11.2.0.4 => 19c no crash 19c => 19c no crash

Tested from 2 différent postgres env (Debian stable - Debian 11/PostgreSQL 13 - Debian sid Debian 12 / PostgreSQL 15)

If I revert to the pre-arrayprefetch commit, no crash with any postgres + orcl client-server combo.

Best regards, Phil

philflorent commented 1 year ago

I am puzzled by the nature of the query that crashes. A query that involves a similar plan does not crash:

postgres=# explain analyze select avg(c1) from (select * from orcl3_usr_dh00.t_clob limit 1) r;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10010.02..10010.03 rows=1 width=32) (actual time=3.421..3.423 rows=1 loops=1)
   ->  Limit  (cost=10000.00..10010.00 rows=1 width=85) (actual time=3.397..3.398 rows=1 loops=1)
         ->  Foreign Scan on t_clob  (cost=10000.00..10040.00 rows=4 width=85) (actual time=3.394..3.395 rows=1 loops=1)
               Oracle query: SELECT /*f791544c1776485d*/ r1."C1" FROM "USR_PRODIGE"."T_CLOB" r1
 Planning Time: 74.759 ms
 Execution Time: 3.595 ms
(6 lignes)

postgres=*# commit;
COMMIT

postgres=# explain analyze select count(1) from (select * from orcl3_usr_dh00.t_clob limit 1) r;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10010.01..10010.02 rows=1 width=8) (actual time=0.807..0.808 rows=1 loops=1)
   ->  Limit  (cost=10000.00..10010.00 rows=1 width=112) (actual time=0.799..0.800 rows=1 loops=1)
         ->  Foreign Scan on t_clob  (cost=10000.00..10040.00 rows=4 width=112) (actual time=0.797..0.797 rows=1 loops=1)
               Oracle query: SELECT /*6119503ffd9c8ead*/ '1' FROM "USR_PRODIGE"."T_CLOB" r1
 Planning Time: 73.444 ms
 Execution Time: 0.976 ms
(6 lignes)

postgres=*# commit;
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ERREUR:  oracleEndTransaction internal error: handle not found in cache

postgres=# explain analyze select count(*) from orcl3_usr_dh00.t_clob;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10040.01..10040.02 rows=1 width=8) (actual time=2.869..2.870 rows=1 loops=1)
   ->  Foreign Scan on t_clob  (cost=10000.00..10040.00 rows=4 width=0) (actual time=0.749..2.860 rows=4 loops=1)
         Oracle query: SELECT /*6119503ffd9c8ead*/ '1' FROM "USR_PRODIGE"."T_CLOB" r1
 Planning Time: 78.781 ms
 Execution Time: 3.111 ms
(5 lignes)

postgres=*# commit;
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

In fact it's only the count(1) that crashes. I cannot reproduce the issue with any other kind of query.

philflorent commented 1 year ago

It's even more specific than that. A count(*) or count1) crash but not a count(c1).

postgres=# explain analyze select count(c1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10010.01..10010.02 rows=1 width=8) (actual time=0.811..0.813 rows=1 loops=1)
   ->  Limit  (cost=10000.00..10010.00 rows=1 width=112) (actual time=0.802..0.803 rows=1 loops=1)
         ->  Foreign Scan on t_clob  (cost=10000.00..20000.00 rows=1000 width=112) (actual time=0.800..0.801 rows=1 loops=1)
               Oracle query: SELECT /*b67eb06688b838f9*/ r1."C1" FROM "USR_PRODIGE31"."T_CLOB" r1
 Planning Time: 69.787 ms
 Execution Time: 0.947 ms
(6 lignes)

postgres=*# commit;
COMMIT
postgres=# explain analyze select count(*) from (select * from orcl_usr_dh00.t_clob limit 1) r;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10010.01..10010.02 rows=1 width=8) (actual time=2.915..2.916 rows=1 loops=1)
   ->  Limit  (cost=10000.00..10010.00 rows=1 width=112) (actual time=2.908..2.910 rows=1 loops=1)
         ->  Foreign Scan on t_clob  (cost=10000.00..20000.00 rows=1000 width=112) (actual time=2.907..2.908 rows=1 loops=1)
               Oracle query: SELECT /*42ca762f9a79ea44*/ '1' FROM "USR_PRODIGE31"."T_CLOB" r1
 Planning Time: 2.017 ms
 Execution Time: 2.980 ms
(6 lignes)

postgres=*# commit;
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

postgres=# explain analyze select count(1) from (select * from orcl_usr_dh00.t_clob limit 1) r;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10010.01..10010.02 rows=1 width=8) (actual time=0.782..0.783 rows=1 loops=1)
   ->  Limit  (cost=10000.00..10010.00 rows=1 width=112) (actual time=0.774..0.775 rows=1 loops=1)
         ->  Foreign Scan on t_clob  (cost=10000.00..20000.00 rows=1000 width=112) (actual time=0.773..0.773 rows=1 loops=1)
               Oracle query: SELECT /*42ca762f9a79ea44*/ '1' FROM "USR_PRODIGE31"."T_CLOB" r1
 Planning Time: 1.793 ms
 Execution Time: 0.849 ms
(6 lignes)

postgres=*# commit;
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ATTENTION:  AbortTransaction while in ABORT state
ERREUR:  oracleEndTransaction internal error: handle not found in cache
ERREUR:  oracleEndTransaction internal error: handle not found in cache
PANIC:  ERRORDATA_STACK_SIZE exceeded
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

philflorent commented 1 year ago

With that in mind I ran other tests. In fact I can reproduce with 19c under certain conditions.

postgres=*# select extensions.oracle_diag('orcl2_usr_dh00');
                                                     oracle_diag
----------------------------------------------------------------------------------------------------------------------
 oracle_fdw 2.6.0devel, PostgreSQL 15.2 (Debian 15.2-2.pgdg120+1), Oracle client 11.2.0.4.0, Oracle server 19.0.0.0.0
(1 ligne)

postgres=# explain analyze select count(c1)  from orcl2_usr_dh00.t_clob ;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=20002.50..20002.51 rows=1 width=8) (actual time=12074.720..12074.724 rows=1 loops=1)
   ->  Foreign Scan on t_clob  (cost=10000.00..20000.00 rows=1000 width=32) (actual time=9.803..12038.552 rows=100000 loops=1)
         Oracle query: SELECT /*f791544c1776485d*/ r1."C1" FROM "USR_PRODIGE"."T_CLOB" r1
 Planning Time: 141.919 ms
 Execution Time: 12075.063 ms
(5 lignes)

postgres=*# commit;
COMMIT
postgres=# explain analyze select count(1) from (select * from orcl2_usr_dh00.t_clob limit 1) r;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10010.01..10010.02 rows=1 width=8) (actual time=1.780..1.783 rows=1 loops=1)
   ->  Limit  (cost=10000.00..10010.00 rows=1 width=112) (actual time=1.755..1.756 rows=1 loops=1)
         ->  Foreign Scan on t_clob  (cost=10000.00..20000.00 rows=1000 width=112) (actual time=1.751..1.752 rows=1 loops=1)
               Oracle query: SELECT /*5360013bbc0205f9*/ '1' FROM "USR_PRODIGE"."T_CLOB" r1 FETCH FIRST 1 ROWS ONLY
 Planning Time: 3.157 ms
 Execution Time: 1.930 ms
(6 lignes)

postgres=*# commit;
COMMIT
postgres=# explain analyze select count(1)  from orcl2_usr_dh00.t_clob ;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=20002.50..20002.51 rows=1 width=8) (actual time=11747.133..11747.136 rows=1 loops=1)
   ->  Foreign Scan on t_clob  (cost=10000.00..20000.00 rows=1000 width=0) (actual time=1.123..11708.967 rows=100000 loops=1)
         Oracle query: SELECT /*6119503ffd9c8ead*/ '1' FROM "USR_PRODIGE"."T_CLOB" r1
 Planning Time: 2.557 ms
 Execution Time: 11747.251 ms
(5 lignes)

postgres=*# commit;
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.

I's not 11.2 specific but it only occurs with count(1) or count(*). Seems in fact very very minor.

laurenz / oracle_fdw

Crash with 2.6devel (array_prefetch) #582