Performance lack retrieving table information of large database

thomasjoscht commented 7 years ago

Hey Guys, I have a simple use case loading table information including table names, column names and data types of one database. One of my target databases is a PostgreSql with 900+ tables and 10k+ columns. I've tried:

Load information with DatabaseReader.ReadAll() without owner. This will take 3 minutes and more.
Load information with DatabaseReader.ReadAll() with owner. This will take up to 1 minute.
Load information with DatabaseReader.AllTables(). This will take up to 1 minute
Extending lib with custom method: Mainly based on ReaderAdapter.Tables(null), ReaderAdapater.Columns(null) and DatabaseReader.DataTypes(). This will take only few (3 - 5) seconds!

Does a general peformance lack exist in large databases? Have you any other suggestion for retrieving my necessary information? I could create a pull request if no other solution exists.

martinjw commented 7 years ago

Yes, performance will be bad on large databases, We're loading a lot of different things under the covers - primary keys, foreign keys, indexes - which is a lot of work.

There's an incremental technique mentioned here: https://github.com/martinjw/dbschemareader/wiki/More-schema-reading#reading-incrementally The DatabaseReader.TableList() is just reading the table names, so that's really quick. In the document, I'm loading the columns, constraints and indexes table by table, which will be slow for you (it works really well in a UI when the user manually drills down). If you just use the internal ReaderAdapter to get columns, you won't have constraints- so no primary keys etc. So, yes, table names and columns are pretty quick, but when you want more information, the time starts to add up. I think I'd want at least pks and foreign keys for a minimal but quick read. What do you think?

thomasjoscht commented 7 years ago

Sorry, next post will be a little bit long but contains lot of info ;-)

I'v done a little deeper investigation on my test database on postrgesql. The following results could be a hint for time consumption but is not representative in all circumstances. My database consists of:

948 Tables (1111 without owner)
9173 Columns (10717 without owner)
0 IdentityColumns
3621 CheckConstraints
943 PrimaryKeys
0 UniqueKeys
1834 ForeignKeys
0 DefaultConstraints
0 Triggers
738 TableDescriptions
845 ColumnDecscriptions
0 ComputedColumns
0 Indexes

I created 4 profiling runs: ReadAll, ReadAll with owner, AllTables, AllTables with owner. Here are the results for ReadAll:

ReadAll -- All Users: 0s -- All Tables: 150s -- All Views: 0s -- All SPs: 0s -- All Seqs: 0s -- Update Refs: 1s
ReadAll with owner -- All Users: 0s -- All Tables: 53s -- All Views: 0s -- All SPs: 0s -- All Seqs: 0s -- Update Refs: 0s

We see that time consuming is nearly only All Tables. Therefore all other could be ignored. Differences with owner is also big, which is clearly.

Next step is to analyze AllTables which will lead me to the following results:

AllTables -- Get Tables: 0s -- Get Columns: 3s -- Get IdentityColumns: 0s -- Get CheckConstraints: 1s -- Get PrimaryKeys: 28s -- Get UniqueKeys: 0s -- Get ForeignKeys: 60s -- Get DefaultConstraints: 0s -- Get Triggers: 0s -- Get TableDescriptions: 0s -- Get ColumnDescriptions: 1s -- Get ComputedColumns: 0s -- MergeIndexColumns: 0s -- Update DataTable objects: 45s
AllTables with owner -- Get Tables: 0s -- Get Columns: 2,5s -- Get IdentityColumns: 0s -- Get CheckConstraints: 1s -- Get PrimaryKeys: 2s -- Get UniqueKeys: 0s -- Get ForeignKeys: 10s -- Get DefaultConstraints: 0s -- Get Triggers: 0s -- Get TableDescriptions: 0s -- Get ColumnDescriptions: 1s -- Get ComputedColumns: 0s -- MergeIndexColumns: 0s -- Update DataTable objects: 40s

In my case exists 2 parts which consumes nearly all time: Determine PK's+FK's and update all DataTable objects in foreach (in TableBuilder.Execute()).

Update DataTable objects -- Update Columns: 0,5s -- Update Identities: 0s -- Update CheckConstraints: 0s -- Update Computed: 0s -- Update Constraints: 0s -- Merge Indexes: 40s -- Update Indexes: 0s -- Update Triggers: 0s -- Update Descriptions: 0s -- Post Processing: 0s

In update DataTable objects the merge index processing which reads indexes per table with ReaderAdapter.Indexes(...) takes long time although no indexes exist.

In the end: I must eliminate PK's+FK's part and computing indexes because I don't need full conditioned informations.

thomasjoscht commented 7 years ago

Solution 1 - Options

My suggestion will contain 2 steps:

First a new method ReadSchema() (or overload of ReadAll) with ReadSchemaOptions parameter. In ReadSchemaOptions every computing part could be turned on and off like
- AllUsers=false
- AllTables=true
- AllViews=false
- AllStoredProcedures=false
- AllSequences=false
- ReadTableOptions = new ReadTableOptions();
Second a new overload for TableBuilder.Execute with ReadTableOptions parameter. In ReadTableOptions could also each part turned on and off like
- [ReadTables=true]
- Columns=true
- IdentityColumns=false
- CheckConstraints=false
- PrimaryKeys=false
- UniqueKeys=false
- ForeignKeys=false
- DefaultConstraints=false
- Indexes=false
- Triggers=false
- TableDescriptions=false
- ColumnDescriptions=false

Solution 2 - New methods

An alternative to my suggestion could be to determine all things out of lib. But this will be problematic because a deep knowledge of processing must be exist and necessary information must be available during public methods. My main problems were:

TableList will not Update DatabaseSchema (therefore a later DataTypes() call will not work).
No ColumnsList method exist.
No DataTypes exists in _readerAdapter.Columns(null).

So in the moment I've extended the lib to address my needs with a custom method

public IList<DatabaseTable> TablesAndColumnsOnly()
{
    RaiseReadingProgress(SchemaObjectType.Tables);
    IList<DatabaseTable> tables;
    using (_readerAdapter.CreateConnection())
    {
        tables = _readerAdapter.Tables(null);
    }
    DatabaseSchema.Tables.Clear();
    DatabaseSchema.Tables.AddRange(tables);

    RaiseReadingProgress(SchemaObjectType.Columns);
    IList<DatabaseColumn> columns;
    using (_readerAdapter.CreateConnection())
    {
        columns = _readerAdapter.Columns(null);
    }

    foreach (var table in DatabaseSchema.Tables)
    {
        table.Columns.Clear();
        table.Columns.AddRange(
            columns.Where(x => string.Equals(x.TableName, table.Name, StringComparison.OrdinalIgnoreCase)
                                && string.Equals(x.SchemaOwner, table.SchemaOwner, StringComparison.OrdinalIgnoreCase)));
    }

    DataTypes();

    return DatabaseSchema.Tables;
}

This method takes only nearly 4 seconds without owner in my scenario described in comment before.

Solution 3 - Public ReaderAdapter

Another solution could be to make the ReaderAdapter of DatabaseReader public. In this case I could write an extension method which works like described in Solution 2. This could be theoretical the simplest modification of the lib. BUT: Problematic is ReaderAdapter class (and other parts) are internal and must be accessable also. Therefore this solution is not an option.

martinjw commented 7 years ago

Solution 1, with loads of options, seems like overkill. Solution 2, TablesAndColumnsOnly, is fine for your scenario. Solution 3, exposing internals, would make a larger, more complicated API surface.

If you want to put solution 2 into a PR, I'll take it.

thomasjoscht commented 7 years ago

Ok. I've created a PR for this. New method named "TablesQuickView".

PS: Currently my Visual Studio 2017 Enterprise Version 15.3.4 is crashing opening project DatabaseSchemaReader.csproj. It seems there is a problem with multi framework targeting. I'm not alone with this problem: see here Change TargetFrameworks property to only one (no matter which one) solves the problem temporarily.

martinjw / dbschemareader