qgis / QGIS

QGIS is a free, open source, cross platform (lin/win/mac) geographical information system (GIS)
https://qgis.org
GNU General Public License v2.0
10.61k stars 3.01k forks source link

Processing algorithm "Join by location" on selected features is way slower than usually #44890

Open tschmetzer opened 3 years ago

tschmetzer commented 3 years ago

What is the bug or the crash?

Using the processing algorithm "Join by location" on selected features in the context of a larger layer is way slower than if the selected features are exported into layers and then executed on these layers. On the selection the processing requires more than an hour whereas on the exported layers it is a matter of 3 seconds. Note that the processing takes place on the same data and has the same results.

2021-08-30 12_52_41-Mellilla_join_selection_by_location_ausschnitt

Steps to reproduce the issue

  1. Load layers a. Pop_Density_per_sqkm_Mollweide_Points_groesser0.sqlite (indexed, 3.5 GB unzipped, 25.965.154 point features) https://drive.google.com/file/d/1AJffraneCiLTq9oF041M3rQADUzdLG6x/view?usp=sharing (550 MB) b. GHS_SMOD_groesser0_raster2points_ohneNoData.gpkg (indexed, 13.6 GB unzipped, 145.665.553 point features) https://drive.google.com/file/d/1TP5EwCew2K2ecOJHSqPfKgAK_i7rs6sO/view?usp=sharing (450 MB)

  2. Graphically select any ~1.600 point features from layer Pop_Density_per_sqkm_Mollweide_Points_groesser0 and ~6.000 point features from GHS_SMOD_groesser0_raster2points_ohneNoData Mellilla_selection_Pop Mellilla_selection_SMOD

  3. Save the selected features of each layer for later comparison ( I called them Melilla as I selected data around the area of the Spanish town Melilla in North Africa) smod_save_selected

  4. Run the "Join by location" processing algorithm on the selection of the original layers Pop_Density_per_sqkm_Mollweide_Points_groesser0 and GHS_SMOD_groesser0_raster2points_ohneNoData according to the following screenshot Mellilla_join_selection_by_location

  5. Observe it will take a long time (there's no need to wait until it's finished to note the difference and it can be cancelled after some minutes when noticing the progress bar is moving forward) Mellilla_join_selection_by_location_forever

  6. Compare to the execution time of the previously saved layers of the selection Mellilla_layers_join_by_location

Mellilla_layers_join_by_location_results

Versions

QGIS version 3.20.2-Odense QGIS code revision 9f59a156 Qt version 5.15.2 Python version 3.9.5 GDAL/OGR version 3.3.1 PROJ version 8.1.0 EPSG Registry database version v10.027 (2021-06-17) GEOS version 3.9.1-CAPI-1.14.2 SQLite version 3.35.2 PDAL version 2.3.0 PostgreSQL client version 13.0 SpatiaLite version 5.0.1 QWT version 6.1.3 QScintilla2 version 2.11.5 OS version Windows 10 Version 2009

Active Python plugins GroupStats QuickOSM db_manager MetaSearch processing

Supported QGIS version

New profile

Additional context

No response

tschmetzer commented 3 years ago

Could it be the case that processing on selections doesn't make use of the index?

gioman commented 3 years ago

On the selection the processing requires more than an hour

@tschmetzer this is how long it took here on master/linux

Execution completed in 136.11 seconds (2 minutes 16 seconds)

uclaros commented 3 years ago

I've been facing the same issue but never took the time to document it. Performance is hugely affected when a selection is used for the base layer, while a selection on the join layer makes no noticeable difference for me. The more obvious way to test is to run the tool on a dataset as a reference, then select all base features, tick selected features only and rerun. Execution time should be the same but is greatly increased depending on the base layer size.

tschmetzer commented 3 years ago

The more obvious way to test is to run the tool on a dataset as a reference, then select all base features, tick selected features only and rerun. Execution time should be the same but is greatly increased depending on the base layer size.

@uclaros Do you mean to use as join layer the same layer as the base layer? If not can you post a screenshot for reproduction?

tschmetzer commented 3 years ago

On the selection the processing requires more than an hour

@tschmetzer this is how long it took here on master/linux

Execution completed in 136.11 seconds (2 minutes 16 seconds)

That's a different magnitude and may vary depending on the setup (Processor, RAM, storage connection, etc.). In my case I was using a network storage. I tested it with having the data on a local hard drive now and I get Execution completed in 361.71 seconds (6 minutes 2 seconds) I suggest to focus on looking at differences with and without selection as differences in hardware configuration may lead to significant differences in execution time. @gioman Have you compared the processing of the selected data to processing an export of the selected data? The latter was way faster in my case.

uclaros commented 3 years ago

The next three runs should have equal execution time:

image Execution completed in 1.92 seconds

image Execution completed in 342.30 seconds (5 minutes 42 seconds)

image Execution completed in 2.06 seconds

The next three runs should have equal execution time:

image Execution completed in 2.85 seconds

image Execution completed in 3.00 seconds

image Way too long, didn't wait!

The next three runs should have equal execution time:

image Execution completed in 4.75 seconds

image Way too long, didn't wait!

image Execution completed in 4.93 seconds

gioman commented 3 years ago

Execution completed in 342.30 seconds (5 minutes 42 seconds)

this is kind of puzzling, as for the other cases the common thing seems to be the selection on the polygon layer.

tschmetzer commented 3 years ago

Execution completed in 342.30 seconds (5 minutes 42 seconds)

this is kind of puzzling, as for the other cases the common thing seems to be the selection on the polygon layer.

Indeed a thrilling case.

@gioman What further feedback is required? @uclaros Thanks for the depiction! Can you attach your sample data for the layers POINT and POLYGON so that the issue can easily be reproduced and debugged?

gioman commented 3 years ago

@uclaros Thanks for the depiction! Can you attach your sample data for the layers POINT and POLYGON so that the issue can easily be reproduced and debugged?

@tschmetzer I guess is not really about a specific dataset.

tschmetzer commented 3 years ago

@uclaros Thanks for the depiction! Can you attach your sample data for the layers POINT and POLYGON so that the issue can easily be reproduced and debugged?

@tschmetzer I guess is not really about a specific dataset.

I know. I just thought this would make reproduction and debugging easier :) Never mind.

uclaros commented 3 years ago

@tschmetzer I guess is not really about a specific dataset.

Indeed. Just run the Random point in extent algorithm. The more points you generate, the more apparent the issue.

roya0045 commented 2 years ago

That's strange, I'm unsure if that's something inherent to QgsProcessingFeatureSource or the algorithm itself. My first guess would be an issue where the returned featurecount is wront and it just return the integral, thus ordering the layers the wrong way. But I'm not sure how much that is the problem.

I'm curious to know if maintaning the 'order' of the dataset and inverting it (by keeping more or less feature selected int he bigger dataset than in the smaller one) has an effect. The geometry type may also have something to do with it, unsure.

Optimisations steps may need to be reworked depending on the findings. I'm someone has some time to test these theories that would help. I can look into some options soon-ish if I don't forget.

roya0045 commented 2 years ago

I tested with small datasets, in any cases using select to subset the dataset reduced time, analysis with polygons took more times (they have more pvertex so computations is taking longer as expected). I can't quite replicate this with my datasets on 3.22 on windows. If someone could provide a small dataset to replicate, that would be appreciated.