qgis / QGIS

QGIS is a free, open source, cross platform (lin/win/mac) geographical information system (GIS)
https://qgis.org
GNU General Public License v2.0
10.35k stars 2.98k forks source link

Text search is always case sensitive for Non-ASCII letters if layer stored in SQLite (GPKG, SpatiaLite, FDO SQLite Spatial) #50863

Closed phidrho closed 1 day ago

phidrho commented 1 year ago

What is the bug or the crash?

When I tried to search for attribute inside GPKG layer which contains letters used in Croatian alphabet (š,đ,č,ć,ž) via Locator with use of af modulator - I did not get any results if entered different case (uppercase or lowercase) than what is stored, I tried to do the same thing for SHP layer and it works as expected (it's not case sensitive).

I also noticed small UI bug - in form for Locator Configuration (Setting>Options>Locator>Configuration) there is a widget (QSpinBox) to set up Maximum number of results which accepts values up to 200, but value cannot be above 99.

Both bugs are shown in video demo at youtube.

Steps to reproduce the issue

Download a sample project with GPKG and SHP variant of same layer - qgis_locator.zip

  1. Use a Locator with af prefix with any of Croatian letters - š, đ, č, ć, ž, Š, Đ, Č, Ć, Ž

  2. for UI widget bug see end of video noted above

Versions

QGIS version 3.28.0-Firenze QGIS code revision ed3ad0430f3 Qt version 5.15.3 Python version 3.9.5 Compiled against GDAL/OGR 3.5.2 Running against GDAL/OGR 3.5.3 PROJ version 9.1.0 EPSG Registry database version v10.074 (2022-08-01) GEOS version 3.10.3-CAPI-1.16.1 SQLite version 3.39.4 PDAL version 2.4.3 PostgreSQL client version unknown SpatiaLite version 5.0.1 QWT version 6.1.6 QScintilla2 version 2.13.1 OS version Windows 10 Version 2009

Active Python plugins actions_for_relations 1.2.0 document_management_system 0.3.1 EquirectangularViewer 1.4 GroupStats 2.2.6 plugin_reloader 0.9.2 QGIS3-getWKT 1.4 QuickWKT 3.1 SelectByRelationship 0.3.3 db_manager 0.1.20 processing 2.12.99

Supported QGIS version

New profile

Additional context

No response

agiudiceandrea commented 1 year ago

The second reported issue may be related to the PR https://github.com/qgis/QGIS/pull/38628 by @3nids.

krzysztof-adamski commented 1 year ago

I think the problem with searching is not even Croatian alphabet, I have same with Polish in filtering (GPKG layer)

Example query = " column_name in ('światłowodowe') " f_request = QgsFeatureRequest().setFilterExpression(query) features = layer.getFeatures(f_request) ids = [feature.id() for feature in features] Always is zero results, but in that column value "światłowodowe" exist. This behavior regarding custom plugin, but if I use python console in QGIS, and add above line : # -- coding: utf-8 -- searching work well.

Is it a bug or something else ?

I trying somthing like: query = f" ({query}) COLLATE Polish_CI_AS " - it wont work

phidrho commented 1 year ago

Hi @krzysztof-adamski,

yes, it's not only Croatian alphabet, I just used this for example, it's probably all letters outside ASCII symbols table as I noted in title. Developers are aware of that, it's common problem, and when bug gets fixed, it will fix all worldwide letters.

Edit:

This behavior regarding custom plugin, but if I use python console in QGIS, and add above line : # -- coding: utf-8 -- searching work well.

This is probably Windows and Qt-related, when you write your script and save it to disk, it probably defaults to System default encoding (for Croatia it's Windows-1250, possibly the same for Polish), but when you run your script it's read from disk as UTF-8 encoded, and your letters are misinterpreted - you think that you are sending query like:

SELECT * FROM table WHERE column_name in ('światłowodowe');

but software actually sends it to database as:

SELECT * FROM table WHERE column_name in ('�wiat�owodowe');

and that's the reason why you don't get any results.

When you entered # -- coding: utf-8 -- at the beginning of script IDE probably forced UTF-8 encoding when saving to disk.

This is another bug/feature, I suggest that you propose to add settings option for default encoding inside IDE.

krzysztof-adamski commented 1 year ago

@phidrho I am not super duper in QGIS, sql, python etc, I figure out this problem just like that

def replace_polish_letter(self, column_name):
    polish_chars = ["ą", "ó", "ę", "ł", "ś", "ć", "ż", "ź", "ń"]

    string = "upper(#COLUMN#)"
    prefix = suffix = ""
    for polish_char in polish_chars:
        prefix += f"replace("
        suffix += f", '{polish_char}', '{unidecode(polish_char)}')"

    concat = prefix + string + suffix
    concat = concat.replace("#COLUMN#", column_name)
query = query.replace(column_name, "#COLUMN#")
query = query.upper()
query = query.replace("#COLUMN#", self.replace_polish_letter(column_name))

I know it is weird, but enough for yet.

phidrho commented 1 year ago

@jef-n, @agiudiceandrea, @DelazJ, @lcosicnus

I've found the source of this bug and fix. In my opinion, this should be fixed in deployment (distribution of SQLite with QGIS), not in the source code of QGIS.

This is a pretty severe bug, in my opinion, it affects almost the whole world and it should be backported to LTS because GPGK is a default file format for QGIS. It also affects SQLite and SpatiaLite layers. The problem happens when QGIS calls DB (GPKG in this case) with a query that uses lower and upper built-in SQL functions that don't work on non-ASCII letters by default on SQLite. It does not affect only Locator but everywhere in QGIS where queries are sent to DB with "upper" and "lower" functions SQLite by default - for example in Select Features By Value and Execute SQL, see image below:

select by values

An example of a query that is sent from Locator is:

SELECT m."fid", m."test", m."test_expression" FROM "test" m WHERE (lower("test") LIKE lower('%Š%') ESCAPE '\' OR lower("test_expression") LIKE lower('%Š%') ESCAPE '\')

My proposed solutions are

1) to enable ICU at compile time for SQLite distributed with QGIS.

2) redefine SQLite's upper and lower functions after creating DB connection to SQLite

3) to enable the loading of SQLite extensions - so that we can load ICU extension ourselves via query, e.g. SELECT load_extension('icu');.

4) the third alternative is to rewrite all calls from QGIS to call a lower and upper in C++ before constructing SQL query to DB. (this would fix current issues but the bug may reappear in the future if some developer is unaware of default SQLite limitations, and also this would cause problems with plugins that are not aware of SQLite limitations).

Steps to temporary bugfix:

How I tested and got the correct behavior (got the bug fixed on my machine):

1) determine what were compile time options of sqlite3.exe in OSGeoW Shell: SELECT pragma compile_options;

outputs:

ATOMIC_INTRINSICS=0
COMPILER=msvc-1929
DEFAULT_AUTOVACUUM
DEFAULT_CACHE_SIZE=-2000
DEFAULT_FILE_FORMAT=4
DEFAULT_JOURNAL_SIZE_LIMIT=-1
DEFAULT_MMAP_SIZE=0
DEFAULT_PAGE_SIZE=4096
DEFAULT_PCACHE_INITSZ=20
DEFAULT_RECURSIVE_TRIGGERS
DEFAULT_SECTOR_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_AUTOCHECKPOINT=1000
DEFAULT_WAL_SYNCHRONOUS=2
DEFAULT_WORKER_THREADS=0
ENABLE_COLUMN_METADATA
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS4
ENABLE_FTS5
ENABLE_RTREE
ENABLE_UPDATE_DELETE_LIMIT
MALLOC_SOFT_LIMIT=1024
MAX_ATTACHED=10
MAX_COLUMN=2000
MAX_COMPOUND_SELECT=500
MAX_DEFAULT_PAGE_SIZE=8192
MAX_EXPR_DEPTH=1000
MAX_FUNCTION_ARG=127
MAX_LENGTH=1000000000
MAX_LIKE_PATTERN_LENGTH=50000
MAX_MMAP_SIZE=0x7fff0000
MAX_PAGE_COUNT=1073741823
MAX_PAGE_SIZE=65536
MAX_SQL_LENGTH=1000000000
MAX_TRIGGER_DEPTH=1000
MAX_VARIABLE_NUMBER=32766
MAX_VDBE_OP=250000000
MAX_WORKER_THREADS=8
MUTEX_W32
SYSTEM_MALLOC
TEMP_STORE=1
THREADSAFE=1

2) install MSYS2 and use MINGW64 shell

pacman -Syu # update packages
# restart MINGW64
pacman -Su #upgrade packages
#install packages
pacman -S base-devel mingw-w64-x86_64-toolchain

3) download ICU source code, extract to home/user folder of msys64

4) compile ICU

cd
cd icu4c-73_2-src/icu/source/
./configure
make
make install

5) download SQLite (amalgamation version) source code, extract to home/user folder of msys64

6) check if all links (dependencies) can be found that are needed for SQLite compilation

echo $(pkg-config --libs --cflags icu-uc icu-io)

7) compile sqlite.exe

cd
cd sqlite-amalgamation-3420000/
gcc -Os -I. \
-DSQLITE_DEFAULT_AUTOVACUUM \
-DSQLITE_DEFAULT_CACHE_SIZE=-2000 \
-DSQLITE_DEFAULT_FILE_FORMAT=4 \
-DSQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=-1 \
-DSQLITE_DEFAULT_MMAP_SIZE=0 \
-DSQLITE_DEFAULT_PAGE_SIZE=4096 \
-DSQLITE_DEFAULT_PCACHE_INITSZ=20 \
-DSQLITE_DEFAULT_RECURSIVE_TRIGGERS \
-DSQLITE_DEFAULT_SECTOR_SIZE=4096 \
-DSQLITE_DEFAULT_SYNCHRONOUS=2 \
-DSQLITE_DEFAULT_WAL_AUTOCHECKPOINT=1000 \
-DSQLITE_DEFAULT_WAL_SYNCHRONOUS=2 \
-DSQLITE_DEFAULT_WORKER_THREADS=0 \
-DSQLITE_ENABLE_COLUMN_METADATA \
-DSQLITE_ENABLE_FTS3 \
-DSQLITE_ENABLE_FTS3_PARENTHESIS \
-DSQLITE_ENABLE_FTS4 \
-DSQLITE_ENABLE_FTS5 \
-DSQLITE_ENABLE_RTREE \
-DSQLITE_ENABLE_UPDATE_DELETE_LIMIT \
-DSQLITE_MALLOC_SOFT_LIMIT=1024 \
-DSQLITE_MAX_ATTACHED=10 \
-DSQLITE_MAX_COLUMN=2000 \
-DSQLITE_MAX_COMPOUND_SELECT=500 \
-DSQLITE_MAX_DEFAULT_PAGE_SIZE=8192 \
-DSQLITE_MAX_EXPR_DEPTH=1000 \
-DSQLITE_MAX_FUNCTION_ARG=127 \
-DSQLITE_MAX_LENGTH=1000000000 \
-DSQLITE_MAX_LIKE_PATTERN_LENGTH=50000 \
-DSQLITE_MAX_MMAP_SIZE=0x7fff0000 \
-DSQLITE_MAX_PAGE_COUNT=1073741823 \
-DSQLITE_MAX_PAGE_SIZE=65536 \
-DSQLITE_MAX_SQL_LENGTH=1000000000 \
-DSQLITE_MAX_TRIGGER_DEPTH=1000 \
-DSQLITE_MAX_VARIABLE_NUMBER=32766 \
-DSQLITE_MAX_VDBE_OP=250000000 \
-DSQLITE_MAX_WORKER_THREADS=8 \
-DSQLITE_MUTEX_W32 \
-DSQLITE_SYSTEM_MALLOC \
-DSQLITE_TEMP_STORE=1 \
-DSQLITE_THREADSAFE=1 \
-DSQLITE_ENABLE_ICU \
shell.c sqlite3.c -lm -lreadline -lncurses $(pkg-config --libs --cflags icu-uc icu-io) -o sqlite3

7) compile sqlite.dll

gcc -shared -Os -I. \
-DSQLITE_DEFAULT_AUTOVACUUM \
-DSQLITE_DEFAULT_CACHE_SIZE=-2000 \
-DSQLITE_DEFAULT_FILE_FORMAT=4 \
-DSQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=-1 \
-DSQLITE_DEFAULT_MMAP_SIZE=0 \
-DSQLITE_DEFAULT_PAGE_SIZE=4096 \
-DSQLITE_DEFAULT_PCACHE_INITSZ=20 \
-DSQLITE_DEFAULT_RECURSIVE_TRIGGERS \
-DSQLITE_DEFAULT_SECTOR_SIZE=4096 \
-DSQLITE_DEFAULT_SYNCHRONOUS=2 \
-DSQLITE_DEFAULT_WAL_AUTOCHECKPOINT=1000 \
-DSQLITE_DEFAULT_WAL_SYNCHRONOUS=2 \
-DSQLITE_DEFAULT_WORKER_THREADS=0 \
-DSQLITE_ENABLE_COLUMN_METADATA \
-DSQLITE_ENABLE_FTS3 \
-DSQLITE_ENABLE_FTS3_PARENTHESIS \
-DSQLITE_ENABLE_FTS4 \
-DSQLITE_ENABLE_FTS5 \
-DSQLITE_ENABLE_RTREE \
-DSQLITE_ENABLE_UPDATE_DELETE_LIMIT \
-DSQLITE_MALLOC_SOFT_LIMIT=1024 \
-DSQLITE_MAX_ATTACHED=10 \
-DSQLITE_MAX_COLUMN=2000 \
-DSQLITE_MAX_COMPOUND_SELECT=500 \
-DSQLITE_MAX_DEFAULT_PAGE_SIZE=8192 \
-DSQLITE_MAX_EXPR_DEPTH=1000 \
-DSQLITE_MAX_FUNCTION_ARG=127 \
-DSQLITE_MAX_LENGTH=1000000000 \
-DSQLITE_MAX_LIKE_PATTERN_LENGTH=50000 \
-DSQLITE_MAX_MMAP_SIZE=0x7fff0000 \
-DSQLITE_MAX_PAGE_COUNT=1073741823 \
-DSQLITE_MAX_PAGE_SIZE=65536 \
-DSQLITE_MAX_SQL_LENGTH=1000000000 \
-DSQLITE_MAX_TRIGGER_DEPTH=1000 \
-DSQLITE_MAX_VARIABLE_NUMBER=32766 \
-DSQLITE_MAX_VDBE_OP=250000000 \
-DSQLITE_MAX_WORKER_THREADS=8 \
-DSQLITE_MUTEX_W32 \
-DSQLITE_SYSTEM_MALLOC \
-DSQLITE_TEMP_STORE=1 \
-DSQLITE_THREADSAFE=1 \
-DSQLITE_ENABLE_ICU \
shell.c sqlite3.c $(pkg-config --libs --cflags icu-uc icu-io) -o sqlite3.dll

8) backup old version of sqlite3.exe, DLL and install new one with it's dependencies

# rename old files
mv C:/OSGeo4W/bin/sqlite3.dll C:/OSGeo4W/bin/sqlite3.dll.backup
mv C:/OSGeo4W/bin/sqlite3.exe C:/OSGeo4W/bin/sqlite3.exe.backup
#install new files
cp C:/msys64/home/user/sqlite-amalgamation-3420000/sqlite3.exe C:/OSGeo4W/bin/sqlite3.exe
cp C:/msys64/home/user/sqlite-amalgamation-3420000/sqlite3.dll C:/OSGeo4W/bin/sqlite3.dll
cp C:/msys64/mingw64/bin/ICUIN73.DLL C:/OSGeo4W/bin/ICUIN73.DLL
cp C:/msys64/mingw64/bin/ICUUC73.DLL C:/OSGeo4W/bin/ICUUC73.DLL
cp C:/msys64/mingw64/bin/ICUDT73.DLL C:/OSGeo4W/bin/ICUDT73.DLL

9) run QGIS and test

EDIT: I'm notifying the rest of QGIS Packaging team: @kalxas @daniviga @sebastic @kyngchaos @PeterPetrik

EDIT2: I removed the "icu-devel" package in step no. 2 from the instructions as it's not needed

sebastic commented 1 year ago

FWIW sqlite3 in Debian is no longer built with ICU support, see: #584506 - libsqlite3-0: Include Unicode support.

Related bugreport: #536662 - sqlite3: CLI not UTF-8 clean

jgrocha commented 1 day ago

Using GDAL compiled from master, it works in QGIS. It works in the command line too.

I have two features in QGIS:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> wkt_geom fid name
Point (-7.74873431926051559 37.09479248838015764) 2 Šimun
Point (-7.74289089247152074 37.08442040582968957) 3 Maša

Searching in QGIS works.

image

The following query also works on the command line:

ogr2ogr -sql "SELECT m."fid", m."name" FROM "test" m WHERE lower("name") LIKE lower('%Š%') ESCAPE '\'" -f CSV /vsistdout/ bratislava.gpkg 
name
Šimun
Maša
jgrocha commented 1 day ago

The GDAL query works, not because lower or upper works. These functions are not working for diacritics.

The GDAL SQL works because the LIKE operator does an insensitive case comparison.

This also works, without the lower function applied to both sides of the comparison:

ogr2ogr -sql "SELECT m."fid", m."name", lower(m."name"), upper(m."name") FROM "test" m WHERE "name" LIKE '%Š%' ESCAPE '\'" -f CSV /vsistdout/ bratislava.gpkg 
name,lower(m.name),upper(m.name)
Šimun,Šimun,ŠIMUN
Maša,maša,MAšA
rouault commented 1 day ago

The GDAL SQL works because the LIKE operator does an insensitive case comparison.

and because since GDAL 3.9, GDAL overloads the base LIKE operator to support case-insentitive UTF-8 comparisons, but restricted to a few character-set (mostly European ones): https://gdal.org/en/latest/user/sql_sqlite_dialect.html#like-operator

phidrho commented 1 day ago

Hi @rouault,

I also tested now on my local machine and can confirm it works. I compiled latest GDAL, and in ogr2ogr works as expected:

GDAL 3.4.1, released 2021/12/27 (Ubuntu 22.04 official repository)

$ ogr2ogr -sql "SELECT m."fid", m."name" FROM "test" m WHERE lower("name") LIKE lower('%Š%') ESCAPE '\'" -f CSV /vsistdout/ /home/phidrho/Desktop/test.gpkg
GPKG: GeoPackage v1.2.0
GDAL: GDALOpen(/home/phidrho/Desktop/test.gpkg, this=0x5bb8bcb461a0) succeeds as GPKG.
GDAL: GDALDriver::Create(CSV,/vsistdout/,0,0,0,Unknown,(nil))
name,
Šimun
GPKG: finalize 0x5bb8bcb4c638
GDALVectorTranslate: 1 features written in layer 'SELECT'
GDAL: GDALClose(/home/phidrho/Desktop/test.gpkg, this=0x5bb8bcb461a0)
GDAL: GDALClose(/vsistdout/, this=0x5bb8bce69d70)
GDAL: In GDALDestroy - unloading GDAL shared library.

GDAL 3.10.0dev-25eb9f6ee7

$ LD_LIBRARY_PATH=. CPL_DEBUG=ON ./apps/ogr2ogr -sql "SELECT m."fid", m."name" FROM "test" m WHERE lower("name") LIKE lower('%Š%') ESCAPE '\'" -f CSV /vsistdout/ /home/phidrho/Desktop/test.gpkg
GPKG: GeoPackage v1.2.0
GDAL: GDALOpen(/home/phidrho/Desktop/test.gpkg, this=0x571e68570ff0) succeeds as GPKG.
GDAL: GDALDriver::Create(CSV,/vsistdout/,0,0,0,Unknown,(nil))
OGR2OGR: Using WriteArrowBatch()
GPKG: finalize 0x571e68570558
name
Maša
Šimun
GDAL: GDALClose(/home/phidrho/Desktop/test.gpkg, this=0x571e68570ff0)
GDAL: GDALClose(/vsistdout/, this=0x571e685b8ce0)
GDAL: In GDALDestroy - unloading GDAL shared library.

I would like to close this bug report, as it is a matter of time for fix to be applied, and it depends on packaging, not anymore on development.

I'm only notifying packaging team: @jef-n, @kyngchaos, @PeterPetrik, @kalxas, @daniviga, @sebastic