Performance issues when running on a DB with thousands of objects

pyrocks commented 8 years ago

Hi Sualeh,

It's me again, sorry for not responding earlier, I've been out of work for the past 7 weeks. The former 2 issues seems resolved, but I still have a severe performance issue running schema crawler on a large DB.

From the queries it looks like you're fetching the info one table at a time, one procedure/function at a time etc., and since I have thousands of objects DB wide each such fetch takes a second or so and it consumes a lot of time. Is it possible that if no filters were used for table/procedure etc - you would fetch all the information needed in a single call per object type - and parse/correlate them by code?

Reason I'm asking is - I'm trying to automate generating an ERD for a schema in a large DB. if that won't finish in under 5 minutes (and it's far from that right now) - I'm back to the drawing board.

Thanks again, Mor

schemacrawler commented 8 years ago

I can take a look at it, Mor. However, I would recommend using SchemaCrawler's powerful grep functionality to create multiple ERDs for different domains within your schema. That may make the ERDs more usable. Also, why are you retreiving stored procedures for an ERD?

pyrocks commented 8 years ago

Hi,

You're right, I actually tried to generate an HTML which describes the schema for the same purpose - and I had to kill the batch after running for 30 minutes - command I used: call java -classpath ../../schemacrawler/lib/;lib/ schemacrawler.Main -server=oracle -host=** -port=** -database=** -user=** -password=** -schemas=** -infolevel=standard -command=schema -outputformat=html -outputfile=***.html %_

As far as I understood grep only allows me to add filters- but what if I wanted the entire ERD/info in a single document? Or did I misunderstood grep's function?

In anyway I just tried to generate an ERD using ... -infolevel=minimum -command=graph -outputformat=pdf -outputfile=***.pdf % and it still spends a lot of time on the PK and All_Arguments/All_Procedures query, not sure why (so far, will try to run an AWR tomorrow to analyze exactly).

Mor

pyrocks commented 8 years ago

one more thing - if you want me i'll open a new issue for it: (SELECT A.TYPECODE FROM ALL_TYPES A WHERE A.TYPE_NAME = ARG.DATA_TYPE) You have to use the owner there as well (and A.owner=:owner or A.owner is null, if we assume one cannot override a reserved type name) theoretically one might define 2 different types in 2 different schemas but with the same name - your query would either get the first it finds or error out.

Mor

schemacrawler commented 8 years ago

Yes, please enter a new issue for that.

schemacrawler commented 8 years ago

Mor, please send me the latest debug logs from SchemaCrawler. Please run SchemaCrawler with -loglevel=ALL, and send me the log file, to sualeh@hotmail.com.

pyrocks commented 8 years ago

Hi,

Sorry for the delay. I will start collecting the logs today and send them once they are ready.

P.S. from security perspective does the loglevel=all collects anything on the target DB ? because if that is the case I might have to edit the logs before sending.

-Mor

pyrocks commented 8 years ago

one more Q (unrelated to the topic) - why aren't you bundling ojdbc7.jar (for oracle RDBMS) ?

pyrocks commented 8 years ago

Hi,

I was finally able to get a result (html) within a timely fashion (although it still took 6.5 minutes) by disabling crawling routines (routines=[nothing])

Seeing the log, I've got a few questions on the way SchemaCrawler works:

Why do you load all the DB connectors while only 1 is being used? I think you should parse the server argument and only then load the relevant DB connector/driver.
if schemas argument is given, why do you need to "INFO: Retrieving all schemas ?", parse them one by one and delete those not relevant? why not a simple query on all_users with the schemas filter included? INFO: Total time taken for "crawlSchemas" - 00:00:01.084 hours - that's not much, but still a wasted second...
I see repeated lines in a row, like these: ??? 10, 2016 2:35:24 PM schemacrawler.crawl.MetadataResultSet close INFO: "retrieveForeignKeysUsingSql" results had 212 rows ??? 10, 2016 2:35:24 PM schemacrawler.crawl.MetadataResultSet close INFO: "retrieveForeignKeysUsingSql" results had 212 rows ??? 10, 2016 2:35:24 PM schemacrawler.schemacrawler.RegularExpressionRule test FINE: Including "SCHEMA_NAME" ??? 10, 2016 2:35:24 PM schemacrawler.schemacrawler.RegularExpressionRule test FINE: Including "SCHEMA_NAME"...
total times so far: INFO: Total time taken for "crawlSchemas" - 00:00:01.084 hours INFO: Total time taken for "crawlDatabaseInfo" - 00:00:00.026 hours INFO: Total time taken for "crawlColumnDataTypes" - 00:00:00.045 hours INFO: Total time taken for "crawlTables" - 00:06:35.941 hours (out of which 96.7% - 00:06:22.887 - "retrieveIndexes") INFO: Total time taken for "crawlRoutines" - 00:00:00.003 hours (because i disabled it - with it I never got to see the end of it, not even after an hour)

so - if we/you can reduce the time of index crawling it can finish in no time. I took another peak at the Indexes SQL itself - I do not see it running more than once per crawl - and the SQL takes 3 seconds in average, so I'm not sure where all the time is being spent. The relevant log lines are as follows (see the time jump between 2:35:24 and 2:41:44): ??? 10, 2016 2:35:24 PM schemacrawler.schemacrawler.RegularExpressionRule test FINE: Including "SCHEMA_NAME" ??? 10, 2016 2:35:24 PM schemacrawler.crawl.SchemaCrawler lambda$crawlTables$30 INFO: Retrieving primary keys and indexes ??? 10, 2016 2:41:44 PM schemacrawler.crawl.IndexRetriever retrieveIndexes INFO: Retrieving indexes, using SQL ??? 10, 2016 2:41:44 PM schemacrawler.utility.QueryUtility executeAgainstSchema FINE: Executing INDEXES: SELECT /+ PARALLEL(AUTO) / ...

-Mor

schemacrawler commented 8 years ago

Question
why aren't you bundling ojdbc7.jar (for oracle RDBMS) ?
[@sualeh] Oracle does not allow me to redistribute the Oracle JDBC jar.
Question
Why do you load all the DB connectors while only 1 is being used? I think you should parse the server argument and only then load the relevant DB connector/driver.
[@sualeh] SchemaCrawler's architecure allows for plugging in extensions easily, by dropping in new jar files. Java services load all available extensions. I do not plan to change this feature.
Question
if schemas argument is given, why do you need to "INFO: Retrieving all schemas ?", parse them one by one and delete those not relevant? why not a simple query on allusers with the schemas filter included?
[@sualeh]_ SchemaCrawler is a database indenpendent tool built to support multiple databases. There is very little database specific code. All database specific functionality is by the use of pluggable extensions. The feature that you suggest would add complexity, with little value.
Question
so - if we/you can reduce the time of index crawling it can finish in no time.
[@sualeh] Please let me know if you can suggest how to reduce the time of index crawling. I am not an Oracle expert.

schemacrawler commented 8 years ago

Mor, please optimize your fixed object statistics - that may help.

pyrocks commented 8 years ago

Hi,

I accept your answers as in the end it seems to "waste" little time in regards to the total time (but those were things that seemed relevant to ask). It's not an issue of dictionary stats/fixed views - as I have mentioned it's a huge DB with tens of thousands of objects.

I did not see from the log what runs or consumes all the mentioned time, even though I used loglevel=ALL. How come there's such a jump in the time between 2 consecutive lines? I would expect every operation to be listed with it's real timestamp, as in "running query X with bind Y - took x seconds, parse took Y seconds" then onto the next query. If the log "spills" all the lines at the exact same second after everything has already finished, it's missing it's purpose in a way.

In anyway - I'll do a proper analysis with AWR (now that it finishes within a timely manner) and get back with my findings.

Mor

pyrocks commented 8 years ago

Hi,

From AWR I was able to see the following SQL as the most problematic: SELECT NULL AS table_cat, c.owner AS table_schem, c.table_name, c.column_name, c.position AS key_seq, c.constraint_name AS pk_name FROM all_cons_columns c, all_constraints k WHERE k.constraint_type = 'P' AND k.table_name = :1 AND k.owner like :2 escape '/' AND k.constraint_name = c.constraint_name AND k.table_name = c.table_name AND k.owner = c.owner ORDER BY column_name

on the large DB - 366 executions 1.16 sec each in average to a total of over 7 minutes. on a much smaller DB (and less active one) but a similar schema structure - 353 executions 0.06 sec each in average to a total of 22.9 seconds. I assume the number of executions depends on the number of objects need to be crawled - the question is whether this can be achieved in less fetch calls, i.e get all the info in 1 query and parse it in code.

The 2nd query which took most of the time is the query for fetching FKs - but it only ran once and took 32 seconds on the larger DB and 21.34 seconds on the smaller DB - I can live with this.

Let me know what you think.

-Mor

pyrocks commented 8 years ago

1 more thing I noticed in the output HTML - for Oracle you should omit the word "clustered" from the index descriptions since there's no such thing as clustered/non-clustered indexes in Oracle (there are other types - BTree (default), reverse, bitmap, IOT etc. (didn't check if you report them correctly)).

schemacrawler commented 8 years ago

Mor, I will take a look at the primary key retrieval in SchemaCrawler, and try to save some time there.

pyrocks commented 8 years ago

Thanks. 1 More thing: is it possible to pass connection properties, or instead of using host,port,database to use a jdbc thin URL (like Oracle's EZConnect) ? The reason i'm asking is on our large DB the connections without tnsnames are SHARED and not DEDICATED, this affects the total time due to "virtual circuit wait" event (user process waiting for a server process to be available to process the request). I want to force it to use a dedicated connection.

schemacrawler commented 8 years ago

Absolutely - use the -url argument. Please run SchemaCrawler wiht -help for details.

pyrocks commented 8 years ago

I missed that, sorry. -url is Working like a charm. It improved the run time but by a margin and not something substantial, again due to that PK query loop.

Another idea I had - if you cannot gather all the info in a single query and parse it by code - perhaps open multiple connections (pool size configured by a parameter) and use all of them for parallel processing. This will require working with threads I assume, but it'll finish much faster than crawling serially. WDYT?

-Mor

schemacrawler commented 8 years ago

I can use a single query for primary keys. That should speed things up. Threading will come with it's own set of problems (and bugs).

pyrocks commented 8 years ago

Great, please let me know as soon as you have something ready for testing. I'm sure it would reduce the times by more than half for the large DB.

I might not contribute code, but I hope my inputs are still valuable :) If this goes well and my managers approve adding this to our build process - you could add me and the company I work for to your "who uses SchemaCrawler" page :)

schemacrawler commented 8 years ago

Please use SchemaCrawler 14.09.02.

pyrocks commented 8 years ago

What an improvement! on the large DB it now takes 25 seconds in total compared to 7+ minutes before the improvement - well done! The breakdown is pretty much this:

30.0% - 00:00:06.054 - "retrieveColumns"
25.0% - 00:00:05.210 - "retrieveForeignKeys"
0.0% - 00:00:00.269 - "filterAndSortTables"
40.0% - 00:00:08.599 - "retrieveIndexes"

On the smaller DB times are now improved to 11 seconds. Finally something I can work with...

Thanks again for all the support and fast response.

schemacrawler commented 8 years ago

Great! I am so glad it worked for you.

cemeyer2 commented 7 years ago

@sualeh are there any plans to support parallelization in the future? I am running schemacrawler on a database schema which has over 50k tables and it takes several hours to complete. Thanks!

adriens commented 7 years ago

Hi @pyrocks , did you try to increase jvm memory options ?

pyrocks commented 7 years ago

Hi,

Well, I'm no longer working in the same place hence not using SchemaCrawler any more, but I don't think I fiddled with jvm options. You can always try yourself...

Pyrocks

On Jun 8, 2017 01:36, "SALES" notifications@github.com wrote:

Hi @pyrocks https://github.com/pyrocks , did you try to increase jvm memory options ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sualeh/SchemaCrawler/issues/79#issuecomment-306944839, or mute the thread https://github.com/notifications/unsubscribe-auth/ARqFYXe5Ur0aFJl9RaBFWj6k91b0g6mJks5sByYFgaJpZM4I-k1c .

schemacrawler commented 7 years ago

Charlie @cemeyer2, no there are no plans to support parallelization in the future. I would recommend that you filter what you are looking for using SchemaCrawler options. You can restrict both the number of objects retrieved, and the amount of metadata retrieved for each object. You have not mentioned which database you are using, but SchemaCrawler is highly optimized for the major databases. You have to make sure that the correct plugin for that database is getting loaded, though. After all this, if you still need to retrieve a large number of objects, consider using SchemaCrawler offline snapshots feature to make future access significantly faster.

schemacrawler / SchemaCrawler

Performance issues when running on a DB with thousands of objects #79