schemacrawler / SchemaCrawler

Free database schema discovery and comprehension tool
http://www.schemacrawler.com/
Other
1.62k stars 200 forks source link

Retrieving system column data types is slow on postgres #229

Closed pyrocks closed 5 years ago

pyrocks commented 5 years ago

Issue

Hi, it's me again :) Now on a new front - i'm trying to use schemacrawler with postgres on RDS. For now - everything works fine, except that "Retrieving system column data types" takes more than a minute for a DB with just a single table (i'm still experimenting).

I don't know what to expect really - but it seems like this shouldn't take so long (a few seconds max is what I would expect). I'm not as experienced with postgres as I am with Oracle so it might take me a while to analyze what's going on in the DB.

Following is the command I used: ./schemacrawler.sh -server=postgresql -host=*** -port=*** -database=*** -schemas=*** -user=*** -infolevel=standard -command=schema -outputformat=pdf -outputfile=./entities.pdf -loglevel=info -noinfo

and Here's the log: ... Nov 18, 2018 9:48:24 AM schemacrawler.crawl.SchemaCrawler crawlSchemas INFO: Retrieved 1 schemas Nov 18, 2018 9:48:24 AM schemacrawler.crawl.SchemaCrawler crawlColumnDataTypes INFO: Crawling column data types Nov 18, 2018 9:48:24 AM schemacrawler.crawl.SchemaCrawler lambda$crawlColumnDataTypes$0 INFO: Retrieving system column data types Nov 18, 2018 9:49:37 AM schemacrawler.crawl.SchemaCrawler lambda$crawlColumnDataTypes$1 INFO: Not retrieving user column data types, since this was not requested Nov 18, 2018 9:49:37 AM schemacrawler.crawl.SchemaCrawler crawlColumnDataTypes INFO: Total time taken for - 00:01:13.014 hours -100.0% - 00:01:13.013 - -0.0% - 00:00:00.001 - ...

Let me know what additional info I can provide.

Environment

SchemaCrawler 15.01.06 on Mac 10.13.6 DB is postgres 10.4 on Amazon RDS with the embedded driver

Thanks, Mor

adriens commented 5 years ago

And what if you run the command on a local copy of that database please ?

pyrocks commented 5 years ago

The problem is I do not have a local postgres installation - all of my DBs are on RDS. I was brave to run the same command but with loglevel=all. It first runs a query to get the available types: SELECT t.typname,t.oid FROM pg_catalog.pg_type t JOIN pg_catalog.pg_namespace n ON (t.typnamespace = n.oid) WHERE n.nspname != 'pg_toast'

Then it runs the following query in a loop and bind a different type everytime: SELECT typinput='array_in'::regproc, typtype FROM pg_catalog.pg_type LEFT JOIN (select ns.oid as nspoid, ns.nspname, r.r from pg_namespace as ns join ( select s.r, (current_schemas(false))[s.r] as nspname from generate_series(1, array_upper(current_schemas(false), 1)) as s(r) ) as r using ( nspname ) ) as sp ON sp.nspoid = typnamespace WHERE typname = $1 ORDER BY sp.r, pg_type.oid DESC LIMIT 1 FE=> Bind(stmt=null,portal=null,$1=<'regproc'>,type=VARCHAR) ... FE=> Bind(stmt=null,portal=null,$1=<'tid'>,type=VARCHAR) ...

This looks inefficient. I am certain the info can be found in a single fetch and then parsed by code. I will take a look how to fetch it all in one go.

adriens commented 5 years ago

For now, what you could do is grabbing an offline schemacrawler export, then parse it locally, just to make it possible for you to report efficiently locally.

pyrocks commented 5 years ago

I'm not entirely sure what you're aiming for. My use case is RDS. I found a query running in a loop, which, if could be run once would improve the performance substantially for such high latency envs. I do not see a point in running it locally, it would obviously be faster, not necessarily more efficient.

sualeh commented 5 years ago

@pyrocks - SchemaCrawler relies on the PostgreSQL JDBC driver to provide type information. You are right in that the PostgreSQL JDBC driver is inefficient in how it obtains type information. You can find the queries it runs in PgDatabaseMetaData.java. It seems that the driver does some post-processing after it retrieves data from the database. You are better off submitting a patch for the driver.

sualeh commented 5 years ago

@pyrocks - the inefficient loop takes place in PgDatabaseMetaData.java#L1138-L1139. That statement in turn calls TypeInfoCache.java#L204-L216 which is very inefficient.

pyrocks commented 5 years ago

@sualeh nice hearing from you again :) Thanks for the assistance, I had no idea the problem is in the jdbc driver - I will monitor the issue.

sualeh commented 5 years ago

Glad to hear from you, @pyrocks Yes, please follow the issue I opened with the PostgreSQL team - https://github.com/pgjdbc/pgjdbc/issues/1342

conker84 commented 5 years ago

Wouldn't it be possible to implement a workaround in the postgres-specific submodule of schemacrawler. I.e. manually querying those tables?

sualeh commented 5 years ago

@conker84 - yes, that would have been the ideal way to go. I looked at the PostgreSQL JDBC driver, and it looks like PostgreSQL does not provide clean set of data dictionary or information schema tables for this purpose.