m-kuhn commented 1 year ago

Cachable provider metadata API -- or how to ⚡ the QGIS loading times

Date 2023/01/01

Author Matthias Kuhn (@m-kuhn)

Contact matthias@opengis.ch

maintainer @m-kuhn

Version QGIS 3.X

Summary

Problem statement

When QGIS loads a layer it does a lot of initialisation, in particular also reading information from the data storage. Among other things

field details
read/write permissions
versions
geometry details (SRID and data type)
metadata
primary key
no data values

Some information (mostly data dependent) can be loaded deferred, e.g. the extent or feature count will be lazy initialized and can be manually invalidated.

Performance

The most prominent cost of this is a lot of expensive I/O during layer load. I.e. a lot of SQL queries or file access happens when creating a QGIS layer. This is done in a synchroneous way and cannot currently be parallelized. Especially on project load can take a lot of time (we've all been there, waiting for the progress bar to reach the right side ...)

Assumptions

The underlying assumption of this behavior is, that what the data provider reports during initialisation time will be valid during the life time of the layer. This is a good working assumption, but might not always be true. We cannot avoid that a sysadmin will revoke write privileges while a layer is loaded or that another QGIS intance changes the table structure or that a file is replaced on disk.

There are even circumstances, where the provider itself cannot store the required knowledge. An example for this are GeoJSON files in which the geometry type is inferred from the data. If the GeoJSON file does not contain any data at project loading time, it will be reported as tabular data only and any styling done in QGIS will be lost on subsequent project storing.

Especially in a service context, we could make use of this. In long-running processes like QGIS server we can invalidate and update caches with additional knowledge which is in application logic. In very short-running processes (think: event-driven/lambda) we can benefit from improved bootstrap performance.

« There are only two hard things in computer science, cache invalidation and naming things. » -- Phil Karlton

Proposed Solution

The lifetime and content of cachable metadata of QgsDataProvider should be made more accessible through API. This will allow to

have control over cache invalidation
populate the cache explicitly from a different location than the provider itself
populate the cache later than at layer init time (lazy load)
collect the cache contents in a thread

The API should be extended as follow:

class QgsDataProvider
{
  enum ProviderOptions
  {
    skipInit = false; //!< Skip initialisation of provider metadata. The provider will initially be invalid until setCachableMetaData is called explicitly or via reloadCachableMetadata.
  }

  /**
   * Returns cachable metadata of this provider instance.
   */
  QVariantMap cachableMetaData() const;
  /**
   * Sets cachable metadata of this provider instance.
   * The provider will trust this metadata and will not go beyond basic validation.
   */
  void setCachableMetaData( const QVariantMap &data );

  QgsProviderCachableMetaDataLoader *cachableMetaDataLoader( std::shared_ptr<QgsCachableMetaDataLoaderContext> context ) cont SIP_FACTORY;

  /**
   * Will reload cachable metadata synchroneously.
   * If synchroneous is true, it will only return, when the data is initialized.
   * If synchroneous is false, it will return immediately and schedule a task to update metadata.
   */
  void reloadCachableMetadata( bool synchroneous = true );

  /**
   * Returns if this provider has been initialized.
   * If this returns false, isValid() will also return false.
   * However, a provider may be initialized but still not valid.
   */
  bool isInitialized() const;

  signals:
    void isInitializedChanged();
    void cachableMetaDataChanged();
}

/**
 * A task to load cachable metadata in the background.
 * Needs to be implemented / subclassed per provider.
 */
class QgsProviderMetaDataLoaderTask : public QgsTask
{
  /**
   * Returns the cachable metadata. Only valid after the task has successfully been completed.
   */
  QVariantMap cachableMetaData() const;
}

/**
 * A context class to keep contextual information about metadata loading.
 * This can be used to deduplicate requests to the data source.
 * Common use cases are:
 *  - Downloading the GetCapabilities document only once
 *  - Running version checks on a database only once
 */
class QgsCachableMetaDataLoaderContext
{
  /**
   * Will return a QVariantMap for the given uri. If the data is not yet loaded, it will call
   * the loader function to initialize the data. Any provider making use of this should make sure that
   * the uri is prefixed with `[providername]://` to avoid collisions.
   * Is thread safe and will make sure that the loader function is only called once for a given uri
   * combination and its results are cached.
   */
  QVariantMap data( const QString &uri, std::function<QVariantMap()> loader );

  /**
   * Register data for a specific uri and dataset.
   * - The uri should uniquely identify a service (a database, a web service, ...)
   * - The datasetKey should identify the current dataset or layer
   * - Data can be arbitrary additional data specific to this dataset
   * Will be called from the main thread
   */
  void registerDataset( const QString &uri, const QString &datasetKey, const QVariantMap &data );

  /**
   * Get all available data for a specific uri.
   * Can be used by the loader lambda in the `data()` method to know which other datasets on the same
   * uri need to be initialized.
   */
  const QMap<QString, QVariantMap> uriData( const QString & uri );
}

class QgsVectorLayer
{
    struct LayerOptions
    {
      bool skipInit = false; //!< Skip layer initialization
    };
}

class QgsRasterLayer
{
    struct LayerOptions
    {
      bool skipInit = false; //!< Skip layer initialization
    };
}

Example usage

# This snippet shows backwards compatible synchroneous initialization of a new layer
vl = QgsVectorLayer(uri, 'layer', 'postgres') # Will initialize synchroneously -- completely backwards compatibile
vl.isValid()
>>> True
vl.isInitialized()
>>> True

# This snippet will load cachable metadata in the background and parallelize
options = QgsVectorLayer.LayerOptions(
                skipInit=True
            )
vl = QgsVectorLayer(uri, 'layer', 'postgres', options)
cachable_metadata_loader = vl.cachableMetaDataLoader()
def set_cachable_metadata():
    vl.setCachableMetadata(cachable_metadata_loader.cachableMetaData())
cachable_metadata_loader.taskCompleted.connect(set_cachable_metadata)
QgsApplication.taskManager().addTask(cachable_metadata_loader)

# This snippet will load cachable metadata in the background and parallelize
# by using a context that allows the initialization process to deduplicate requests
options = QgsVectorLayer.LayerOptions(
                skipInit=True
            )
context = QgsCachableMetaDataLoaderContext()
loader_tasks = list()
for datasource in datasources:
    vl = QgsVectorLayer(uri, 'layer', 'postgres', options)
    cachable_metadata_loader = vl.cachableMetaDataLoader(context)
    def set_cachable_metadata():
        vl.setCachableMetadata(cachable_metadata_loader.cachableMetaData())
    cachable_metadata_loader.taskCompleted.connect(set_cachable_metadata)
    loader_tasks.append(cachable_metadata_loader)

for task in tasks:
    QgsApplication.taskManager().addTask(loader_task)

# Explicit initialization.
# This serves as an illustrative example, normally the data will be generated and restored from a cache location.

cachable_meta_data = {
    'postgis_version': '3.3.2',
    'permissions': ['Delete', 'ChangeAttributeValues', 'ChangeGeometries', 'AddFeatures'],
    'fields': [
        'id': {
            'type': 'int',
            'typeName': 'int8',
            'comment': 'The primary key'
        },
        'name': {
            'type': 'QString',
            'typeName': 'varchar',
            'length': 20
        },
        't_ili_tid': {
            'type': 'QString',
            'typeName': 'uuid',
            'length': -1
        }
    ],
    'extent': [2592691.11, 1213811.25, 2644721.31, 1261222.73]
}
options = QgsVectorLayer.LayerOptions(
                skipInit=True
            )
vl = QgsVectorLayer(uri, 'layer', 'postgres', options)
vl.isInitialized()
>>> False
vl.isValid()
>>> False
vl.setCachableMetadata(cachable_data)
vl.isInitialized()
>>> True
vl.isValid() # Subject to discussion if validity needs a physical check about the exitance of the data
>>> True

QGIS project property "lazy load" (optional)

A new project property "Lazy initialize layers" will be added. If enabled, the project will skip initializing all layers at project load time and instead schedule tasks for this to parallelize loading.

GUI implications

A new layer indicator that depicts layers that have not been initialized is added. By clicking on it, layer.reloadCachableMetadata() is triggered.

Performance Implications

Can result in a massive speedup of loading times

Further Considerations/Improvements

Incompatible metadata

There is a risk, that incompatible metadata is fed into a provider. However, the same behavior can already nowadays be "achieved" by changing the datasource while QGIS is running, as has been explained in the introduction. Therefore, QGIS needs to be able to deal with such situations in any case and related maintenance effort will also help QGIS in other scenarios.

Incomplete metadata

If incomplete metadata is fed into the provider (i.e. missing cache keys or invalid options), the provider shall continue to work if a cache key is not strictly required / has safe default values or report this to the use and continue to return False from isInitialized().

Providers which have not (yet) been ported to this API can continue to initialize their data as part of the constructor and immediately return true for isInitialized().

Storing cachable meta data in the QGIS project file

A evident addition would be the storage of cachable metadata in the QGIS project file. However it is unclear to me how we can control cache invalidation or how we should advice users to enable or disable such a behavior. Enabling this when it's not appropriate can lead to unpleasurable behavior. Giving this option to users can lead to the impression of a quick win at the risk of usage without a complete understanding of the impact. As such, I am unsure if -- or not -- we should add this to QGIS core or if this is better suited as a plugin at least in the beginning. Open for discussion.

isValid vs. isInitialized

This difference is best explained with an appropriate representation in the layer tree. Users have been getting used to seeing layers flagged as invalid in the layer tree. The initial state for lazily loaded layers should be depicted with a different icon. Therefore a layer is uninitialized as long as it has not even tried to connect its data provider. Once it did try and failed, it's state shall be changed to invalid. Alternative proposal: we could have an enum State instead of multiple boolean flags.

GUI implications for lazy loading

When lazy loading layers, the "handle bad layers" dialog will only show up when all layers have been "tried".
Authentication (or other) information may be needed when connecting to a provider. QgsCredentials takes care of multithreading. In fact, it will even be more pleasurable to use than currently (compare the strange cursor behavior when entering a postgres password). If providers are not prepared for that, some ground work will need to be done for full compatibility.

Backwards Compatibility

Fully backwards compatible. A phased implementation is possible (and likely) where some providers will continue to ignore the skipInit flag and continue to always initialize in the constructor.

Votes

(required)

wonder-sk commented 1 year ago

Hi @m-kuhn this is a very interesting proposal!

By a coincidence, we got a related piece of work funded few days ago: to make project loading parallel - for a client that uses project files with hundreds of map layers (mainly postgis and remote COGs). The basic idea is to first create the data providers in parallel in a thread, and then back on the main thread create map layer objects. I will try to prepare a more detailed spec soon when I get to it.

I have considered also an approach similar to yours, but decided not to go ahead because each provider would need special treatment and with raster layers using remote COGs we don't have much control as all the caching is done by GDAL.

Regarding your proposal:

in principle I like it as it may speed things up considerably
storing cachable metadata in QGIS project files - it feels like a can of worms :-) Not just the fact that cached data may become invalid over time, but also different versions of QGIS may probably need different versions of cached data as data providers evolve, and it could be a nightmare to keep compatibility between versions.
isValid / isInitialized - my preference would be to have a proper state machine, so we can capture all the possible states (including e.g. "initializing" state) and signal state changes appropriately
backwards compatibility - isn't this going to have much larger impact?
- for core/gui - won't we need to deal with the case of some layers being still initialized in many places? (e.g. if I try to identify features or run some processing algs, and the needed layers are not initialized yet - do we wait or fail?)
- for plugins - if I understand correctly, in the proposed solution, QgsProject will emit a signal that project loading is complete, yet even "valid" layers will still be getting initialized, so the plugins would need to modify the code when expecting some layers to be present?

m-kuhn commented 1 year ago

Thanks for the feedback @wonder-sk

The basic idea is to first create the data providers in parallel in a thread, and then back on the main thread create map layer objects. I will try to prepare a more detailed spec soon when I get to it. I have considered also an approach similar to yours, but decided not to go ahead because each provider would need special treatment and with raster layers using remote COGs we don't have much control as all the caching is done by GDAL.

I understand that this is tempting when thinking about project files and desktop usage. For other scenarios (standalone apps, web services) it would be nice to have more control over cache location. Something else is shared initialization (postgres version, optimizing queries from "where table name in (...., ....)" Instead of "where table name = "..."), which i probably less clean to implement this way. Last but not least, parallelization gain is limited by the thread pool size of the postgres, ogr etc. providers, that's where an external cache location could really shine. In short: I'd rather go for a future proof API than a quick win here.

~In this proposal in the case of COG, loading a project with lazy loading activated (QGIS desktop, the case of your client), it will call GDALOpen in a thread inside QgsProviderMetaDataLoaderGDALTask::cachableMetaData() and GDAL will initialize it's metadata there.~ Having GDAL metadata cachable in external caches will require work with upstream.

backwards compatibility - isn't this going to have much larger impact?

for core/gui - won't we need to deal with the case of some layers being still initialized in many places? (e.g. if I try to identify features or run some processing algs, and the needed layers are not initialized yet - do we wait or fail?)

This will be treated equal to an invalid data provider, a concept that exists. Most things will just fail. A wait concept can be added later on.

for plugins - if I understand correctly, in the proposed solution, QgsProject will emit a signal that project loading is complete, yet even "valid" layers will still be getting initialized, so the plugins would need to modify the code when expecting some layers to be present?

If lazy loading is enabled, layers will initially be invalid and only become valid over time. Plugins cannot trust for all layers to be fully initialized when the project is loaded, correct. A signal could be added that is emitted when all layer loading tasks initiated by opening a QGIS project are finished.

If lazy loading is not enabled, projects will continue to initialize the way the currently do.

m-kuhn commented 1 year ago

Correction, GDAL will require some work upstream to be compatible with external metadata. In any case, the call to GDALOpen will still need to happen somewhere. I'm wondering if this should rather be a "init on first use" approach similar to postgres/ogr with a connection pool.

rouault commented 1 year ago

GDAL will require some work upstream to be compatible with external metadata.

GDAL itself, or the QGIS GDAL & OGR providers ? I'm not sure how that could be done in GDAL itself, without changing each of its 250 drivers. Or maybe by using some GDAL dataset proxying mechanism, which exists at least on the raster side for the VRT driver: as a VRT file can reference thousands of source rasters, it doesn't directly reference them, but a proxy dataset that on demand opens the underlying dataset.

m-kuhn commented 1 year ago

I was thinking that in GDAL itself it should be possible to add this incrementally for specific drivers, while others will continue to always init on open (similar to this proposal here). Your to use a proxy dataset via VRT seems quite interesting as well. With that it would even possible to avoid any HTTP calls already now -- purely by changing the local configuration,, without any code changes on QGIS side -- if I understand correctly? As for OGR I haven't checked how many SQL queries are automatically sent to e.g. a gpkg by GDAL vs. by QGIS and how much optimization is needed on this front.

This QEP focuses on the part of the code where QGIS itself does a lot of (network) I/O at loading time. And provides API's to group requests, get them off the main thread and cache if applicable.

rouault commented 1 year ago

I was thinking that in GDAL itself it should be possible to add this incrementally for specific drivers, while others will continue to always init on open

My intuition / gut feeling is that we don't want to mess in GDAL drivers themselves. Too much complication & work (even if done incrementally). Drivers should not have to deal with that kind of higher level optimizations. A number of GDAL drivers already implement differed loading of metadata, when that's technically possible, that is for information that isn't crucial to even establish that the dataset is valid, and the information is got through a virtual method (like GDALDataset::GetSpatialRef()), and not a non-virtual method (like GDALDataset::GetXSize()).

Your to use a proxy dataset via VRT seems quite interesting as well. With that it would even possible to avoid any HTTP calls already now

yes, try

gdal_translate /vsicurl/http://download.osgeo.org/gdal/data/gtiff/utm.tif out.vrt
CPL_VSIL_SHOW_NETWORK_STATS=YES python -c "from osgeo import gdal; gdal.Open('out.vrt')"

vs

CPL_VSIL_SHOW_NETWORK_STATS=YES python -c "from osgeo import gdal; gdal.Open('/vsicurl/http://download.osgeo.org/gdal/data/gtiff/utm.tif')"

(if you try with gdalinfo you'll see a network access, because GetOverviewCount() is called, and that's forwarded to the source)

As for OGR I haven't checked how many SQL queries are automatically sent to e.g. a gpkg by GDAL

For GeoPackage, on GDAL side itself, that is relatively cheap, in "normal" situations, where you don't have a big number of layers per dataset. What can be costly is that if you have layers with the generic Geometry type, in which case, will cause iterating over all features to discover their geometry types (an optimization for that was done recently to make it like 10x times faster though)

OGR_VRT can also be used in a similar way as the raster side.

That said VRT datasets are mostly read-only, so if you need to edit, the provider would have to open the "real" underlying dataset.

m-kuhn commented 1 year ago

I have no immediate plans for bigger refactorings in GDAL

As for OGR I haven't checked how many SQL queries are automatically sent to e.g. a gpkg by GDAL

For GeoPackage, on GDAL side itself, that is relatively cheap, in "normal" situations, where you don't have a big number of layers per dataset. What can be costly is that if you have layers with the generic Geometry type, in which case, will cause iterating over all features to discover their geometry types (an optimization for that was done recently to make it like 10x times faster though)

This sounds like the GeoJSON example in the QEP description. I must admit I just assumed that it's QGIS doing this inspection and not GDAL and did not check. Being able to inject this kind of knowledge into the dataset construction process would be nice.

OGR_VRT can also be used in a similar way as the raster side.

That said VRT datasets are mostly read-only, so if you need to edit, the provider would have to open the "real" underlying dataset.

That's interesting as well and for many cases the read only restriction is acceptable (presentation, processing source, ...)

haubourg commented 1 year ago

That's a great proposal addressing a major bottleneck for QGIS server or playing with massive databases. I feel totally concerned and will be happy to test this in depth.

storing cachable metadata in QGIS project files - it feels like a can of worms :-) Not just the fact that cached data may become invalid over time, but also different versions of QGIS may probably need different versions of cached data as data providers evolve, and it could be a nightmare to keep compatibility between versions.

I was involved in the "trust project" option story, and I feel we already are in this can of worms. From what I get, the metadata layer part is already cached in the qgs file. ie layer extent, primary keys, row count estimate etc.. The misnamed "trust" option at start did this. But it seems we have had a lot of regressions - or hidden changes over the years.

@m-kuhn could you specify what is wrong or missing in the current design? To me, we lack unit test, UI sucks, we would need a per layer option. All the best!

m-kuhn commented 1 year ago

Thanks for the feedback and questions @haubourg

From top of my head, I see these limitations, I tried to separate it but many items are interconnected:

as you noted, the ones that @wonder-sk pointed out also apply to the current setup
the connection of "metadata" and "qgis project file" is IMHO not ideal. "user configuration" and "system information cache" are married in one location.
it's designed specifically for qgis server, e.g. in a WPS environment we culd want to dynamically create layers via API calls in a performant way (we should not need to deal with XML there)
no multithreading (also tackled by the idea of @wonder-sk )
the initialization without any cache around is isolated per layer (no possibility to group queries that can be aggregated for all layers on the same DB)
no control over cache invalidation (invalidating a cache currently means deploying a new qgis project file)
no possibility for system specific cache location (memcached, redis, ...) -- (possibly via passing around XML string, not ideal from an API perspective)

Given that, here's what I think we should do and where this proposal is aimed:

group layer initialization by source
initialize cache in threads
create an API that gives control over cache location and lifetime within a specific system context

rldhont commented 1 year ago

Having worked on the trust option of the QGIS project, I thought that the XML QGIS project was a way to store provider metadata.

When the trust option is false, a lot of data stored in the XML QGIS project is not used to build layers and providers. Extent, primary key field, geometry type, geometry field and geometry SRID are read from XML QGIS Project if the trust option is true. Others like field details are never read from XML QGIS Project and are always checked by the provider as the construction.

Thanks @m-kuhn to propose a way to speed up QGIS loading times by caching provider metadata.

troopa81 commented 1 year ago

Thank you very much for this proposal, this looks like the good approach to tackle this issue.

Regarding further considerations, I would be in favor of storing meta data in project.

It could solve the QGIS Server first GetCapabilities issue which is too long because of project loading (and loading in a separated thread won't change anything here). IMHO Many of (Most of?) users don't change metadata everyday, so it makes sense to have an option "Don't reload metadata everytime we open the project".

However it is unclear to me how we can control cache invalidation or how we should advice users to enable or disable such a behavior. Enabling this when it's not appropriate can lead to unpleasurable behavior. Giving this option to users can lead to the impression of a quick win at the risk of usage without a complete understanding of the impact.

We could have a project option "Don't reload metadata everytime we open the project" vs "Reload metadata everytime we open the project" (which could default to the second if we want to preserve the way it is). This option could be inherited by layer, and once you change it at project level, QGIS could ask you "Do you want to apply these settings to all existing layer?". We could add a "Reload metadata for this layer action on menu in browser"

And if we describe thoroughly what does metadata means, I think user can fully understand the impact.

the connection of "metadata" and "qgis project file" is IMHO not ideal. "user configuration" and "system information cache" are married in one location.

Could we not write those information in a separate file inside the qgz file like it's done for auxiliary data?

Do you have a planning to implement this ? Is it already funded ? We have also clients interested in speed up project loading, so we might help/participate on this if needed.

m-kuhn commented 1 year ago

Thanks for the feedback @troopa81

I am still waiting to hear more from @wonder-sk plans before moving this on, my plan is to start looking for interested parties mid March if nothing moves before. Also thanks for offering help, I'll get in touch with you in this case.

Regarding project baked in metadata, I would prefer implementing that with a plugin. In the case of server it's even more likely that projects have been produced with a different QGIS version on desktop and this could potentially be a can of worms opened up. By delegating this to a plugin we can protect core QGIS from bogus issues.

Jean-Roc commented 1 year ago

Hi @m-kuhn, following the addition of the grant-2023 tag, could you please give an update on this project and your expectations for outside contributions ?

m-kuhn commented 1 year ago

The plan is generally unchanged. There is already some related work in the pipeline, as announced by @wonder-sk , many of the points in this proposal here stay valid and are not affected by the ongoing work.

There will be room for outside collaboration, specifically on the implementation for various providers, some of this has already been discussed (off list) with @troopa81 , These additions are best done by people with the required infrastructure (oracle, arcgis services, hana db, ...).

wonder-sk commented 1 year ago

For what it's worth, this is the work on parallel loading of projects we have done: https://github.com/qgis/QGIS/pull/53069 Real-world big project with 300+ layers (remote COGs + PostGIS) gets project load time from ~10min to ~3min. There's definitely still room for improvement, but it's much faster. Especially the postgres provider seems like it could do fewer server requests (or at least batch them).

@m-kuhn what is still not completely clear to me is how a common QGIS user would be affected by this QEP - would there be some buttons to enable/disable caching of provider metadata? (A flag on project level?) And is the grant proposal aiming to only add the infrastructure or also implement caching for some data providers?

m-kuhn commented 1 year ago

The deduplication of postgres provider queries is one of the main goals here (things like query pg version etc). If things go as planned (= stable) there should be no buttons involved for that part. I assume that by itself will be another bit speed boost.

The grant proposal's original aim was to also do multithreaded loading (with a different approach than https://github.com/qgis/QGIS/pull/53069 as discussed earlier in here, I didn't expect this pull request to land actually as I was still waiting for some more detailed specs as mentioned in https://github.com/qgis/QGIS-Enhancement-Proposals/issues/261#issuecomment-1373922106). But as it currently stands it looks like we can actually go further and already do infrastructure for external persistent caching which will can be leveraged by plugins.

wonder-sk commented 1 year ago

I didn't expect this pull request to land actually as I was still waiting for some more detailed specs as mentioned in https://github.com/qgis/QGIS-Enhancement-Proposals/issues/261#issuecomment-1373922106)

Yeah sorry about that - I thought I would do a QEP once we fleshed out details, but it turned out there were no significant architectural changes required, so we went ahead with PR directly...

m-kuhn commented 1 year ago

It remains to be seen if part of the work of https://github.com/qgis/QGIS/pull/53069 needs to be refactored, given the architectural limitations mentioned earlier in this discussion.

anitagraser commented 10 months ago

Hi @m-kuhn. I'm working on wrapping up the 2023 grant programme. How is the progress on this project? Can you give me an estimate for the completion and final report? As you probably know, I'm collecting all final reports and summarizing them so everyone can read up on the results of the programme.

m-kuhn commented 10 months ago

Hi @anitagraser no coding has been done on this so far. This will be done for QGIS 3.38.

Note: this has only been accepted as accepted grant late in the year

palmerj commented 3 weeks ago

Query: Will this work address the issue of slow loading for PostGIS views? Currently, non-materialised views are slow to add or load into a project, even with estimated metadata enabled, as the extent is calculated during the initial load of the layer.

qgis / QGIS-Enhancement-Proposals