Open m-kuhn opened 1 year ago
Hi @m-kuhn this is a very interesting proposal!
By a coincidence, we got a related piece of work funded few days ago: to make project loading parallel - for a client that uses project files with hundreds of map layers (mainly postgis and remote COGs). The basic idea is to first create the data providers in parallel in a thread, and then back on the main thread create map layer objects. I will try to prepare a more detailed spec soon when I get to it.
I have considered also an approach similar to yours, but decided not to go ahead because each provider would need special treatment and with raster layers using remote COGs we don't have much control as all the caching is done by GDAL.
Regarding your proposal:
Thanks for the feedback @wonder-sk
The basic idea is to first create the data providers in parallel in a thread, and then back on the main thread create map layer objects. I will try to prepare a more detailed spec soon when I get to it. I have considered also an approach similar to yours, but decided not to go ahead because each provider would need special treatment and with raster layers using remote COGs we don't have much control as all the caching is done by GDAL.
I understand that this is tempting when thinking about project files and desktop usage. For other scenarios (standalone apps, web services) it would be nice to have more control over cache location. Something else is shared initialization (postgres version, optimizing queries from "where table name in (...., ....)" Instead of "where table name = "..."), which i probably less clean to implement this way. Last but not least, parallelization gain is limited by the thread pool size of the postgres, ogr etc. providers, that's where an external cache location could really shine. In short: I'd rather go for a future proof API than a quick win here.
~In this proposal in the case of COG, loading a project with lazy loading activated (QGIS desktop, the case of your client), it will call GDALOpen in a thread inside QgsProviderMetaDataLoaderGDALTask::cachableMetaData()
and GDAL will initialize it's metadata there.~ Having GDAL metadata cachable in external caches will require work with upstream.
backwards compatibility - isn't this going to have much larger impact?
- for core/gui - won't we need to deal with the case of some layers being still initialized in many places? (e.g. if I try to identify features or run some processing algs, and the needed layers are not initialized yet - do we wait or fail?)
This will be treated equal to an invalid data provider, a concept that exists. Most things will just fail. A wait concept can be added later on.
- for plugins - if I understand correctly, in the proposed solution, QgsProject will emit a signal that project loading is complete, yet even "valid" layers will still be getting initialized, so the plugins would need to modify the code when expecting some layers to be present?
If lazy loading is enabled, layers will initially be invalid and only become valid over time. Plugins cannot trust for all layers to be fully initialized when the project is loaded, correct. A signal could be added that is emitted when all layer loading tasks initiated by opening a QGIS project are finished.
If lazy loading is not enabled, projects will continue to initialize the way the currently do.
Correction, GDAL will require some work upstream to be compatible with external metadata. In any case, the call to GDALOpen will still need to happen somewhere. I'm wondering if this should rather be a "init on first use" approach similar to postgres/ogr with a connection pool.
GDAL will require some work upstream to be compatible with external metadata.
GDAL itself, or the QGIS GDAL & OGR providers ? I'm not sure how that could be done in GDAL itself, without changing each of its 250 drivers. Or maybe by using some GDAL dataset proxying mechanism, which exists at least on the raster side for the VRT driver: as a VRT file can reference thousands of source rasters, it doesn't directly reference them, but a proxy dataset that on demand opens the underlying dataset.
I was thinking that in GDAL itself it should be possible to add this incrementally for specific drivers, while others will continue to always init on open (similar to this proposal here). Your to use a proxy dataset via VRT seems quite interesting as well. With that it would even possible to avoid any HTTP calls already now -- purely by changing the local configuration,, without any code changes on QGIS side -- if I understand correctly? As for OGR I haven't checked how many SQL queries are automatically sent to e.g. a gpkg by GDAL vs. by QGIS and how much optimization is needed on this front.
This QEP focuses on the part of the code where QGIS itself does a lot of (network) I/O at loading time. And provides API's to group requests, get them off the main thread and cache if applicable.
I was thinking that in GDAL itself it should be possible to add this incrementally for specific drivers, while others will continue to always init on open
My intuition / gut feeling is that we don't want to mess in GDAL drivers themselves. Too much complication & work (even if done incrementally). Drivers should not have to deal with that kind of higher level optimizations. A number of GDAL drivers already implement differed loading of metadata, when that's technically possible, that is for information that isn't crucial to even establish that the dataset is valid, and the information is got through a virtual method (like GDALDataset::GetSpatialRef()), and not a non-virtual method (like GDALDataset::GetXSize()).
Your to use a proxy dataset via VRT seems quite interesting as well. With that it would even possible to avoid any HTTP calls already now
yes, try
gdal_translate /vsicurl/http://download.osgeo.org/gdal/data/gtiff/utm.tif out.vrt
CPL_VSIL_SHOW_NETWORK_STATS=YES python -c "from osgeo import gdal; gdal.Open('out.vrt')"
vs
CPL_VSIL_SHOW_NETWORK_STATS=YES python -c "from osgeo import gdal; gdal.Open('/vsicurl/http://download.osgeo.org/gdal/data/gtiff/utm.tif')"
(if you try with gdalinfo you'll see a network access, because GetOverviewCount() is called, and that's forwarded to the source)
As for OGR I haven't checked how many SQL queries are automatically sent to e.g. a gpkg by GDAL
For GeoPackage, on GDAL side itself, that is relatively cheap, in "normal" situations, where you don't have a big number of layers per dataset. What can be costly is that if you have layers with the generic Geometry type, in which case, will cause iterating over all features to discover their geometry types (an optimization for that was done recently to make it like 10x times faster though)
OGR_VRT can also be used in a similar way as the raster side.
That said VRT datasets are mostly read-only, so if you need to edit, the provider would have to open the "real" underlying dataset.
I have no immediate plans for bigger refactorings in GDAL
As for OGR I haven't checked how many SQL queries are automatically sent to e.g. a gpkg by GDAL
For GeoPackage, on GDAL side itself, that is relatively cheap, in "normal" situations, where you don't have a big number of layers per dataset. What can be costly is that if you have layers with the generic Geometry type, in which case, will cause iterating over all features to discover their geometry types (an optimization for that was done recently to make it like 10x times faster though)
This sounds like the GeoJSON example in the QEP description. I must admit I just assumed that it's QGIS doing this inspection and not GDAL and did not check. Being able to inject this kind of knowledge into the dataset construction process would be nice.
OGR_VRT can also be used in a similar way as the raster side.
That said VRT datasets are mostly read-only, so if you need to edit, the provider would have to open the "real" underlying dataset.
That's interesting as well and for many cases the read only restriction is acceptable (presentation, processing source, ...)
That's a great proposal addressing a major bottleneck for QGIS server or playing with massive databases. I feel totally concerned and will be happy to test this in depth.
storing cachable metadata in QGIS project files - it feels like a can of worms :-) Not just the fact that cached data may become invalid over time, but also different versions of QGIS may probably need different versions of cached data as data providers evolve, and it could be a nightmare to keep compatibility between versions.
I was involved in the "trust project" option story, and I feel we already are in this can of worms. From what I get, the metadata layer part is already cached in the qgs file. ie layer extent, primary keys, row count estimate etc.. The misnamed "trust" option at start did this. But it seems we have had a lot of regressions - or hidden changes over the years.
@m-kuhn could you specify what is wrong or missing in the current design? To me, we lack unit test, UI sucks, we would need a per layer option. All the best!
Thanks for the feedback and questions @haubourg
From top of my head, I see these limitations, I tried to separate it but many items are interconnected:
Given that, here's what I think we should do and where this proposal is aimed:
Having worked on the trust option of the QGIS project, I thought that the XML QGIS project was a way to store provider metadata.
When the trust option is false, a lot of data stored in the XML QGIS project is not used to build layers and providers. Extent, primary key field, geometry type, geometry field and geometry SRID are read from XML QGIS Project if the trust option is true. Others like field details are never read from XML QGIS Project and are always checked by the provider as the construction.
Thanks @m-kuhn to propose a way to speed up QGIS loading times by caching provider metadata.
Thank you very much for this proposal, this looks like the good approach to tackle this issue.
Regarding further considerations, I would be in favor of storing meta data in project.
It could solve the QGIS Server first GetCapabilities issue which is too long because of project loading (and loading in a separated thread won't change anything here). IMHO Many of (Most of?) users don't change metadata everyday, so it makes sense to have an option "Don't reload metadata everytime we open the project".
However it is unclear to me how we can control cache invalidation or how we should advice users to enable or disable such a behavior. Enabling this when it's not appropriate can lead to unpleasurable behavior. Giving this option to users can lead to the impression of a quick win at the risk of usage without a complete understanding of the impact.
We could have a project option "Don't reload metadata everytime we open the project" vs "Reload metadata everytime we open the project" (which could default to the second if we want to preserve the way it is). This option could be inherited by layer, and once you change it at project level, QGIS could ask you "Do you want to apply these settings to all existing layer?". We could add a "Reload metadata for this layer action on menu in browser"
And if we describe thoroughly what does metadata means, I think user can fully understand the impact.
the connection of "metadata" and "qgis project file" is IMHO not ideal. "user configuration" and "system information cache" are married in one location.
Could we not write those information in a separate file inside the qgz file like it's done for auxiliary data?
Do you have a planning to implement this ? Is it already funded ? We have also clients interested in speed up project loading, so we might help/participate on this if needed.
Thanks for the feedback @troopa81
I am still waiting to hear more from @wonder-sk plans before moving this on, my plan is to start looking for interested parties mid March if nothing moves before. Also thanks for offering help, I'll get in touch with you in this case.
Regarding project baked in metadata, I would prefer implementing that with a plugin. In the case of server it's even more likely that projects have been produced with a different QGIS version on desktop and this could potentially be a can of worms opened up. By delegating this to a plugin we can protect core QGIS from bogus issues.
Hi @m-kuhn, following the addition of the grant-2023 tag, could you please give an update on this project and your expectations for outside contributions ?
The plan is generally unchanged. There is already some related work in the pipeline, as announced by @wonder-sk , many of the points in this proposal here stay valid and are not affected by the ongoing work.
There will be room for outside collaboration, specifically on the implementation for various providers, some of this has already been discussed (off list) with @troopa81 , These additions are best done by people with the required infrastructure (oracle, arcgis services, hana db, ...).
For what it's worth, this is the work on parallel loading of projects we have done: https://github.com/qgis/QGIS/pull/53069 Real-world big project with 300+ layers (remote COGs + PostGIS) gets project load time from ~10min to ~3min. There's definitely still room for improvement, but it's much faster. Especially the postgres provider seems like it could do fewer server requests (or at least batch them).
@m-kuhn what is still not completely clear to me is how a common QGIS user would be affected by this QEP - would there be some buttons to enable/disable caching of provider metadata? (A flag on project level?) And is the grant proposal aiming to only add the infrastructure or also implement caching for some data providers?
The deduplication of postgres provider queries is one of the main goals here (things like query pg version etc). If things go as planned (= stable) there should be no buttons involved for that part. I assume that by itself will be another bit speed boost.
The grant proposal's original aim was to also do multithreaded loading (with a different approach than https://github.com/qgis/QGIS/pull/53069 as discussed earlier in here, I didn't expect this pull request to land actually as I was still waiting for some more detailed specs as mentioned in https://github.com/qgis/QGIS-Enhancement-Proposals/issues/261#issuecomment-1373922106). But as it currently stands it looks like we can actually go further and already do infrastructure for external persistent caching which will can be leveraged by plugins.
I didn't expect this pull request to land actually as I was still waiting for some more detailed specs as mentioned in https://github.com/qgis/QGIS-Enhancement-Proposals/issues/261#issuecomment-1373922106)
Yeah sorry about that - I thought I would do a QEP once we fleshed out details, but it turned out there were no significant architectural changes required, so we went ahead with PR directly...
It remains to be seen if part of the work of https://github.com/qgis/QGIS/pull/53069 needs to be refactored, given the architectural limitations mentioned earlier in this discussion.
Hi @m-kuhn. I'm working on wrapping up the 2023 grant programme. How is the progress on this project? Can you give me an estimate for the completion and final report? As you probably know, I'm collecting all final reports and summarizing them so everyone can read up on the results of the programme.
Hi @anitagraser no coding has been done on this so far. This will be done for QGIS 3.38.
Note: this has only been accepted as accepted grant late in the year
Query: Will this work address the issue of slow loading for PostGIS views? Currently, non-materialised views are slow to add or load into a project, even with estimated metadata enabled, as the extent is calculated during the initial load of the layer.
Cachable provider metadata API -- or how to ⚡ the QGIS loading times
Date 2023/01/01
Author Matthias Kuhn (@m-kuhn)
Contact matthias@opengis.ch
maintainer @m-kuhn
Version QGIS 3.X
Summary
Problem statement
When QGIS loads a layer it does a lot of initialisation, in particular also reading information from the data storage. Among other things
Some information (mostly data dependent) can be loaded deferred, e.g. the extent or feature count will be lazy initialized and can be manually invalidated.
Performance
The most prominent cost of this is a lot of expensive I/O during layer load. I.e. a lot of SQL queries or file access happens when creating a QGIS layer. This is done in a synchroneous way and cannot currently be parallelized. Especially on project load can take a lot of time (we've all been there, waiting for the progress bar to reach the right side ...)
Assumptions
The underlying assumption of this behavior is, that what the data provider reports during initialisation time will be valid during the life time of the layer. This is a good working assumption, but might not always be true. We cannot avoid that a sysadmin will revoke write privileges while a layer is loaded or that another QGIS intance changes the table structure or that a file is replaced on disk.
There are even circumstances, where the provider itself cannot store the required knowledge. An example for this are GeoJSON files in which the geometry type is inferred from the data. If the GeoJSON file does not contain any data at project loading time, it will be reported as tabular data only and any styling done in QGIS will be lost on subsequent project storing.
Especially in a service context, we could make use of this. In long-running processes like QGIS server we can invalidate and update caches with additional knowledge which is in application logic. In very short-running processes (think: event-driven/lambda) we can benefit from improved bootstrap performance.
Proposed Solution
The lifetime and content of cachable metadata of
QgsDataProvider
should be made more accessible through API. This will allow toThe API should be extended as follow:
Example usage
QGIS project property "lazy load" (optional)
A new project property "Lazy initialize layers" will be added. If enabled, the project will skip initializing all layers at project load time and instead schedule tasks for this to parallelize loading.
GUI implications
A new layer indicator that depicts layers that have not been initialized is added. By clicking on it,
layer.reloadCachableMetadata()
is triggered.Performance Implications
Can result in a massive speedup of loading times
Further Considerations/Improvements
Incompatible metadata
There is a risk, that incompatible metadata is fed into a provider. However, the same behavior can already nowadays be "achieved" by changing the datasource while QGIS is running, as has been explained in the introduction. Therefore, QGIS needs to be able to deal with such situations in any case and related maintenance effort will also help QGIS in other scenarios.
Incomplete metadata
If incomplete metadata is fed into the provider (i.e. missing cache keys or invalid options), the provider shall continue to work if a cache key is not strictly required / has safe default values or report this to the use and continue to return
False
fromisInitialized()
.Providers which have not (yet) been ported to this API can continue to initialize their data as part of the constructor and immediately return
true
forisInitialized()
.Storing cachable meta data in the QGIS project file
A evident addition would be the storage of cachable metadata in the QGIS project file. However it is unclear to me how we can control cache invalidation or how we should advice users to enable or disable such a behavior. Enabling this when it's not appropriate can lead to unpleasurable behavior. Giving this option to users can lead to the impression of a quick win at the risk of usage without a complete understanding of the impact. As such, I am unsure if -- or not -- we should add this to QGIS core or if this is better suited as a plugin at least in the beginning. Open for discussion.
isValid vs. isInitialized
This difference is best explained with an appropriate representation in the layer tree. Users have been getting used to seeing layers flagged as invalid in the layer tree. The initial state for lazily loaded layers should be depicted with a different icon. Therefore a layer is
uninitialized
as long as it has not even tried to connect its data provider. Once it did try and failed, it's state shall be changed to invalid. Alternative proposal: we could have an enumState
instead of multiple boolean flags.GUI implications for lazy loading
When lazy loading layers, the "handle bad layers" dialog will only show up when all layers have been "tried".
Authentication (or other) information may be needed when connecting to a provider.
QgsCredentials
takes care of multithreading. In fact, it will even be more pleasurable to use than currently (compare the strange cursor behavior when entering a postgres password). If providers are not prepared for that, some ground work will need to be done for full compatibility.Backwards Compatibility
Fully backwards compatible. A phased implementation is possible (and likely) where some providers will continue to ignore the
skipInit
flag and continue to always initialize in the constructor.Votes
(required)