Is it possible to speed up JOINs between (large r/o) Oracle table and small PG one?

gnudiff commented 4 months ago

My typical use of oracle_fdw is to pull the data from large Oracle tables according to criteria that sometimes need to be specified in my PG tables -- I only have read-only access to Oracle DB.

So, a typical go would be to:

pull number of IDs to be processed from Oracle,
filtering out the IDs that already are in my PG table: UNLESS oracle rows with that ID were updated in Oracle after last pull.
ask Oracle for details (from another table) for those pulled IDs.

Like:

INSERT INTO my_pg_table (cargoid, update_date) 
(
SELECT cargoid, update_date FROM ora_table ORA_T 
[number of Oracle joins]
LEFT OUTER JOIN my_pg_table PG_T 
WHERE [various oracle checks] 
AND PG_T.cargoid IS NULL OR ( PG_T.cargoid=ORA_T.cargoid AND ORA_T.update_date > PG_T.update_date) 
)

This query takes 10mins due to the join of ORA_T and PG_T. If I run the query directly from oracle (without trying to match existing cargoids, it takes 10secs.

The next part is "ask Oracle for details on pulled IDs", which is essentially:

SELECT [columns] FROM oracle_details ORA_DT WHERE [Oracle expressions] AND ORA_DT.cargoid IN (SELECT cargoid FROM my_pg_table)

This takes in excess of 20mins, whereas it only takes seconds to do if I supply the list of cargoids as a simple expression IN ( 1,2,3,4,5...)

Currently I have workarounds like:

Do the 1st Oracle query by saving to CSV
import the results of query to PG with filtering in the import process
specify the list of cargoids for 2nd query by manually assemebling a list of cargoids to be passed in IN() statement

But that is ugly and will cease to work if I get to 1000+ cargoids (limit for IN() clause for Oracle). And it means I actually am not using oracle_fdw for the queries!

The EXPLAIN on actual first query goes like (ead_cargos is the only PG table in EXPLAIN):

GroupAggregate  (cost=30078.55..30078.75 rows=1 width=616)
  Group Key: c.cargoid, cdet.....(ORACLE tables keys)
  ->  Incremental Sort  (cost=30078.55..30078.59 rows=2 width=716)
        Sort Key: c.cargoid, cdet.... (ORACLE tables keys)
        Presorted Key: c.cargoid
        ->  Nested Loop  (cost=20063.49..30078.54 rows=1 width=716)
              Join Filter: ((c.cargoid = cdet.cargoid)  (ORACLE tables)
              ->  Merge Semi Join  (cost=10063.49..10063.54 rows=1 width=762)
                    Merge Cond: (c.cargoid = ead_cargos.cargoid) ------------------- this is join with PG table
                    ->  Sort  (cost=10050.06..10050.07 rows=5 width=754)
                          Sort Key: c.cargoid
                          ->  Foreign Scan  (cost=10000.00..10050.00 rows=5 width=754)
                                Oracle query: SELECT /*6c6b92bed826a19e*/ r1."CARGOID", ... (ORACLE tables joins and WHERE clauses))
                    ->  Sort  (cost=13.44..13.44 rows=1 width=8)
                          Sort Key: ead_cargos.cargoid 
                          ->  Seq Scan on ead_cargos  (cost=0.00..13.43 rows=1 width=8)
                                Filter: (run_id = '{{ run_id }}'::text)
              ->  Foreign Scan on v_cargo_det cdet  (cost=10000.00..20000.00 rows=1000 width=94)
                    Oracle query: SELECT /*8fc3a042f6bee6c2*/ r4."CARGOID", r4."LINEITEMID", .... (ORACLE table columns)

Is there anything I can do to speed things up using oracle_fdw?

laurenz commented 4 months ago

I cannot think of any way to improve oracle_fdw so that it handles that more efficiently.

But why do you need to create a CSV file? I would write a PL/pgSQL function that first queries the Oracle table for the list of IDs, and store that in a temporary table or an array. Then the function eliminates the IDs you don't want, and finally it constructs a dynamic SQL statement with the literal numbers in it.

Something like

CREATE FUNCTION xyz() RETURNS SETOF foreign_table_b
  LANGUAGE plpgsql AS
$$DECLARE
   ids bigint[];
BEGIN
   SELECT array_agg(a.id) INTO ids FROM foreign_table_a AS a
   WHERE /* condition */
   AND NOT EXISTS (SELECT FROM local_table AS l
                   WHERE l.id = a.id);

   RETURN QUERY EXECUTE
      format(
         'SELECT FROM foreign_table_b WHERE id = ANY (%s)', ids;
END;$$;

laurenz commented 4 months ago

Can we close the issue?

gnudiff commented 4 months ago

Can we close the issue?

Sure. I'll see if I can work with the function, thanks a lot for considering the case and sketching the idea!

(as regards export to csv it is just that the whole queries are part of task flow app and I might need to rerun tasks from previous results).

laurenz / oracle_fdw

Is it possible to speed up JOINs between (large r/o) Oracle table and small PG one? #685