qgis / QGIS

QGIS is a free, open source, cross platform (lin/win/mac) geographical information system (GIS)
https://qgis.org
GNU General Public License v2.0
10.59k stars 3k forks source link

Loading a Large VRT with thousands of sources takes forever to load #58601

Open ar-siddiqui opened 2 months ago

ar-siddiqui commented 2 months ago

What is the bug or the crash?

I have a VRT with 70k sources, gdalinfo on it takes 3 4 seconds, when loading the same VRT on QGIS it can take up to 20 minutes (this is the time to load layer in the project, the time to fetch data is separate which is dependent on how zoomed out you are). Once the VRT is loaded, and working zoomed in, it takes a reasonable amount of time to fetch data. It is only at the load of a layer that the QGIS takes unreasonable time.

Steps to reproduce the issue

Unfortunately, my VRT has sources in a private S3 bucket so it can't be shared for reproducibility but here is a public VRT that shows similar behavior.

/vsicurl/https://rockyweb.usgs.gov/vdelivery/Datasets/Staged/Elevation/13/TIFF/USGS_Seamless_DEM_13.vrt

Time to perform gdalinfo:

$ time gdalinfo /vsicurl/https://rockyweb.usgs.gov/vdelivery/Datasets/Staged/Elevation/13/TIFF/USGS_Seamless_DEM_13.vrt
Driver: VRT/Virtual Raster
.....

real    0m1.002s
user    0m0.093s
sys     0m0.029s

Loading the same layer in QGIS takes 30 seconds. Then there are additional second or two to fetch actual data, which is reasonable. (Please work at high zoom level with this data or QGIS will try to fetch data for large areas)

Versions

QGIS 3.38.2 GDAL 3.9.2

Supported QGIS version

New profile

Additional context

No response

rouault commented 2 months ago

Loading the same layer in QGIS takes 30 seconds

yes, because QGIS zooms out on the layer, and this VRT has no external overviews. There is nothing QGIS can do about that. This is pretty much a "defect" on the data producer side. They should run "gdaladdo" on their VRT

ar-siddiqui commented 2 months ago

Loading the same layer in QGIS takes 30 seconds

yes, because QGIS zooms out on the layer, and this VRT has no external overviews. There is nothing QGIS can do about that. This is pretty much a "defect" on the data producer side. They should run "gdaladdo" on their VRT

Thanks for the reply. Is there a reason for QGIS to zoom out on the layer and read whole raster, when you are loading it after zooming in on a small area?

And is that efficient?

jakimowb commented 2 months ago

It timed how long it takes to open the VRT sources using a QgsRasterLayer an a gdal.Dataset. Even with "loadDefaultStyle=False", the creation of the QgsRasterLayer object for a single VRT source takes much more time compared to the QgsRaterLayer object. It looks like the creation of QgsRasterLayer already starts a time-consuming operation.


import datetime
import os.path
import re
from pathlib import Path
from qgis.core import QgsRasterLayer
from osgeo import gdal

path_vrt = 'USGS_Seamless_DEM_13.vrt'
sources = []
rx = re.compile(r'<SourceFilename relativeToVRT="0">(?P<source>.*)</SourceFilename>')
with open(path_vrt) as f:
    for line in f.readlines():
        if match := rx.search(line):
            sources.append( match.group('source'))

n = len(sources)
for i, src in enumerate(sources):
    print(f'Test {i+1}/{n}: {os.path.basename(src)} ', end='')
    t1 = datetime.datetime.now()
    lyr = QgsRasterLayer(src, options=QgsRasterLayer.LayerOptions(loadDefaultStyle=False))
    t2 = datetime.datetime.now()
    ds = gdal.Open(src)
    assert isinstance(ds, gdal.Dataset)
    t3 = datetime.datetime.now()
    print(f'QgsRasterLayer {(t2-t1).total_seconds()}s\tgdal.Dataset {(t3-t2).total_seconds()}s')

Test 1/1432: USGS_13_n06e162.tif QgsRasterLayer 10.141198s gdal.Dataset 0.0s Test 2/1432: USGS_13_n06e163.tif QgsRasterLayer 9.349241s gdal.Dataset 0.00052s Test 3/1432: USGS_13_n07e134.tif QgsRasterLayer 10.528733s gdal.Dataset 0.000998s Test 4/1432: USGS_13_n07e151.tif QgsRasterLayer 10.345912s gdal.Dataset 0.0s Test 5/1432: USGS_13_n07e152.tif QgsRasterLayer 10.822737s gdal.Dataset 0.002001s Test 6/1432: USGS_13_n07e158.tif QgsRasterLayer 10.244739s gdal.Dataset 0.002s Test 7/1432: USGS_13_n08e134.tif QgsRasterLayer 10.787935s gdal.Dataset 0.000995s Test 8/1432: USGS_13_n08e151.tif QgsRasterLayer 10.344938s gdal.Dataset 0.002752s Test 9/1432: USGS_13_n08e152.tif QgsRasterLayer 9.913716s gdal.Dataset 0.0s Test 10/1432: USGS_13_n08e158.tif QgsRasterLayer 9.050872s gdal.Dataset 0.0s Test 11/1432: USGS_13_n09e134.tif QgsRasterLayer 8.897478s gdal.Dataset 0.001545s Test 12/1432: USGS_13_n10e138.tif QgsRasterLayer 9.423416s gdal.Dataset 0.000994s

jakimowb commented 2 months ago

@rouault I breakpointed the loading times down to a GDALOpenEx in qgsgdalproviderbase.cpp , Line 314 (QgsGdalProviderBase::gdalOpen):

grafik

and I can reproduce it in python:

from osgeo import gdal
gdal.UseExceptions()
# long loading VRT
uri1 = r'/vsicurl/https://rockyweb.usgs.gov/vdelivery/Datasets/Staged/Elevation/13/TIFF/USGS_Seamless_DEM_13.vrt'
# same VRT, but downloaded
uri2 = r'USGS_Seamless_DEM_13.vrt' # local VRT
# 1st VRT source
uri3 = r'/vsicurl/https://prd-tnm.s3.amazonaws.com/StagedProducts/Elevation/13/TIFF/current/n39w105/USGS_13_n39w105.tif'
for i, uri in enumerate([uri1, uri2, uri3]):
    t0 = datetime.datetime.now()
    ds = gdal.OpenEx(uri, 0, [], [], [] )
    print(f'uri{i+1}: {datetime.datetime.now() - t0}')

Output:

uri1: 0:05:22.600787 uri2: 0:00:00.046641 uri3: 0:00:00.944662