laurenz / oracle_fdw

PostgreSQL Foreign Data Wrapper for Oracle
http://laurenz.github.io/oracle_fdw/
Other
496 stars 156 forks source link

Is it possible to speed up JOINs between (large r/o) Oracle table and small PG one? #685

Closed gnudiff closed 4 months ago

gnudiff commented 4 months ago

My typical use of oracle_fdw is to pull the data from large Oracle tables according to criteria that sometimes need to be specified in my PG tables -- I only have read-only access to Oracle DB.

So, a typical go would be to:

Like:

INSERT INTO my_pg_table (cargoid, update_date) 
(
SELECT cargoid, update_date FROM ora_table ORA_T 
[number of Oracle joins]
LEFT OUTER JOIN my_pg_table PG_T 
WHERE [various oracle checks] 
AND PG_T.cargoid IS NULL OR ( PG_T.cargoid=ORA_T.cargoid AND ORA_T.update_date > PG_T.update_date) 
)

This query takes 10mins due to the join of ORA_T and PG_T. If I run the query directly from oracle (without trying to match existing cargoids, it takes 10secs.

The next part is "ask Oracle for details on pulled IDs", which is essentially:

SELECT [columns] FROM oracle_details ORA_DT WHERE [Oracle expressions] AND ORA_DT.cargoid IN (SELECT cargoid FROM my_pg_table)

This takes in excess of 20mins, whereas it only takes seconds to do if I supply the list of cargoids as a simple expression IN ( 1,2,3,4,5...)

Currently I have workarounds like:

But that is ugly and will cease to work if I get to 1000+ cargoids (limit for IN() clause for Oracle). And it means I actually am not using oracle_fdw for the queries!

The EXPLAIN on actual first query goes like (ead_cargos is the only PG table in EXPLAIN):

GroupAggregate  (cost=30078.55..30078.75 rows=1 width=616)
  Group Key: c.cargoid, cdet.....(ORACLE tables keys)
  ->  Incremental Sort  (cost=30078.55..30078.59 rows=2 width=716)
        Sort Key: c.cargoid, cdet.... (ORACLE tables keys)
        Presorted Key: c.cargoid
        ->  Nested Loop  (cost=20063.49..30078.54 rows=1 width=716)
              Join Filter: ((c.cargoid = cdet.cargoid)  (ORACLE tables)
              ->  Merge Semi Join  (cost=10063.49..10063.54 rows=1 width=762)
                    Merge Cond: (c.cargoid = ead_cargos.cargoid) ------------------- this is join with PG table
                    ->  Sort  (cost=10050.06..10050.07 rows=5 width=754)
                          Sort Key: c.cargoid
                          ->  Foreign Scan  (cost=10000.00..10050.00 rows=5 width=754)
                                Oracle query: SELECT /*6c6b92bed826a19e*/ r1."CARGOID", ... (ORACLE tables joins and WHERE clauses))
                    ->  Sort  (cost=13.44..13.44 rows=1 width=8)
                          Sort Key: ead_cargos.cargoid 
                          ->  Seq Scan on ead_cargos  (cost=0.00..13.43 rows=1 width=8)
                                Filter: (run_id = '{{ run_id }}'::text)
              ->  Foreign Scan on v_cargo_det cdet  (cost=10000.00..20000.00 rows=1000 width=94)
                    Oracle query: SELECT /*8fc3a042f6bee6c2*/ r4."CARGOID", r4."LINEITEMID", .... (ORACLE table columns)

Is there anything I can do to speed things up using oracle_fdw?

laurenz commented 4 months ago

I cannot think of any way to improve oracle_fdw so that it handles that more efficiently.

But why do you need to create a CSV file? I would write a PL/pgSQL function that first queries the Oracle table for the list of IDs, and store that in a temporary table or an array. Then the function eliminates the IDs you don't want, and finally it constructs a dynamic SQL statement with the literal numbers in it.

Something like

CREATE FUNCTION xyz() RETURNS SETOF foreign_table_b
  LANGUAGE plpgsql AS
$$DECLARE
   ids bigint[];
BEGIN
   SELECT array_agg(a.id) INTO ids FROM foreign_table_a AS a
   WHERE /* condition */
   AND NOT EXISTS (SELECT FROM local_table AS l
                   WHERE l.id = a.id);

   RETURN QUERY EXECUTE
      format(
         'SELECT FROM foreign_table_b WHERE id = ANY (%s)', ids;
END;$$;
laurenz commented 4 months ago

Can we close the issue?

gnudiff commented 4 months ago

Can we close the issue?

Sure. I'll see if I can work with the function, thanks a lot for considering the case and sketching the idea!

(as regards export to csv it is just that the whole queries are part of task flow app and I might need to rerun tasks from previous results).