rogerbinns / apsw

Another Python SQLite wrapper
https://rogerbinns.github.io/apsw/
Other
739 stars 97 forks source link

VFS Null filename opens #506

Closed abdelouahabb closed 10 months ago

abdelouahabb commented 10 months ago

I try to retrieve data using SELECT something from product LIMIT ?,? from an obfuscated db, with 500 pagination skip, the problem I get an error when I reach 5500 (skip=6000) : Is there any limit should be raised for the cursor ?

The error :

Traceback (most recent call last):
  File "C:\Users\abdelouahab\AppData\Local\Programs\Python\Python311\Lib\site-packages\tornado\web.py", line 1790, in _execute
    result = await result
             ^^^^^^^^^^^^
  File "c:\Users\abdelouahab\Desktop\stocks\application\app.py", line 1005, in get
    cur.execute('''
  File "c:\Users\abdelouahab\Desktop\stocks\application\app.py", line 55, in xOpen
    return ObfuscatedVFSFile(self.base_vfs, name, flags)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\abdelouahab\Desktop\stocks\application\app.py", line 61, in __init__
    apsw.VFSFile.__init__(self, inheritfromvfsname, filename, flags)
TypeError: filename should be a string
Windows 10 64bits,
APSW 3.44.2.0,
Python Python 3.11.7
rogerbinns commented 10 months ago

APSW doesn't change SQLite's behaviour when it comes to executing SQL.

TypeError: filename should be a string

That is the error you are getting. You should print out what the value of filename actually is, as well as the flags. augmented tracebacks will also contain the same information. My guess is SQLite is spilling from memory to disk.

If you want further help you'll need to include the APSW version, as well as enough for me to reproduce the problem.

abdelouahabb commented 10 months ago

APSW doesn't change SQLite's behaviour when it comes to executing SQL.

TypeError: filename should be a string

That is the error you are getting. You should print out what the value of filename actually is, as well as the flags. augmented tracebacks will also contain the same information. My guess is SQLite is spilling from memory to disk.

If you want further help you'll need to include the APSW version, as well as enough for me to reproduce the problem.

Sorry, here is what I got :

 File "c:\Users\abdelouahab\Desktop\stocks\application\app.py", line 1008, in get
    cur.execute('''
    contenu = 'Année Passée'
    exc = TypeError('filename should be a string')
    fin = 500
    nombre = (8099,)
    now = datetime.datetime(2024, 1, 5, 18, 16, 43, 611, tzinfo=zoneinfo.ZoneInfo(key='Africa/Algiers'))
    pagination = 16
    self = <__main__.Recette object at 0x000001EB7FED2390>
    start = 6000
    url = '/admin/recette?q=l'
    what = 'l&skip'

  File "c:\Users\abdelouahab\Desktop\stocks\application\app.py", line 57, in xOpen
    return ObfuscatedVFSFile(self.base_vfs, name, flags)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    flags = [1054, 0]
    name = None
    self = <__main__.ObfuscatedVFS object at 0x000001EB7F8932A0>

  File "c:\Users\abdelouahab\Desktop\stocks\application\app.py", line 63, in __init__
    apsw.VFSFile.__init__(self, inheritfromvfsname, filename, flags)
    filename = None
    flags = [1054, 0]
    inheritfromvfsname = ''
    self = <__main__.ObfuscatedVFSFile object at 0x000001EB104A7F70>

TypeError: filename should be a string


Windows 10 64bits,
APSW 3.44.2.0,
Python Python 3.11.7

Seems it loses the filename argument after a certain cursor limit ?

rogerbinns commented 10 months ago

This is a bug in APSW.

The open flags are SQLITE_OPEN_CREATE | SQLITE_OPEN_DELETEONCLOSE | SQLITE_OPEN_EXCLUSIVE | SQLITE_OPEN_READWRITE | SQLITE_OPEN_TRANSIENT_DB

The SQLite doc does say NULL is allowed. I vaguely recall in the past that a zero length string was passed.

This exact combination only occurs for OP_OpenAutoindex / OP_OpenEphemeral in the source

The reason you are hitting this is because as you add larger offsets to your LIMIT , more data has to be kept in memory to be sorted until memory limits are hit and disk is used instead.

Once I have a reproducer I'll let you know of a workaround.

abdelouahabb commented 10 months ago

This is a bug in APSW.

The open flags are SQLITE_OPEN_CREATE | SQLITE_OPEN_DELETEONCLOSE | SQLITE_OPEN_EXCLUSIVE | SQLITE_OPEN_READWRITE | SQLITE_OPEN_TRANSIENT_DB

The SQLite doc does say NULL is allowed. I vaguely recall in the past that a zero length string was passed.

This exact combination only occurs for OP_OpenAutoindex / OP_OpenEphemeral in the source

The reason you are hitting this is because as you add larger offsets to your LIMIT , more data has to be kept in memory to be sorted until memory limits are hit and disk is used instead.

Once I have a reproducer I'll let you know of a workaround.

Thank you very much ^^

abdelouahabb commented 10 months ago

Reproduced it with the code :

import apsw
import uuid
import time
import sys
from apsw.ext import print_augmented_traceback

def obfuscate(data):
    if not data: return data
    return bytes([x ^ 0xcc for x in data])

class ObfuscatedVFS(apsw.VFS):

    def __init__(self, vfsname="obfuscated", basevfs=""):
        self.vfs_name = vfsname
        self.base_vfs = basevfs
        apsw.VFS.__init__(self, self.vfs_name, self.base_vfs)

    def xOpen(self, name, flags: int):
        return ObfuscatedVFSFile(self.base_vfs, name, flags)

class ObfuscatedVFSFile(apsw.VFSFile):

    def __init__(self, inheritfromvfsname, filename, flags):
        apsw.VFSFile.__init__(self, inheritfromvfsname, filename, flags)

    def xRead(self, amount, offset):
        return obfuscate(super().xRead(amount, offset))

    def xWrite(self, data, offset):
        super().xWrite(obfuscate(data), offset)

obfuvfs = ObfuscatedVFS()

open_flags = apsw.SQLITE_OPEN_READWRITE | apsw.SQLITE_OPEN_CREATE
open_flags |= apsw.SQLITE_OPEN_URI

conn = apsw.Connection("my.db", vfs=obfuvfs.vfs_name)

cur = conn.cursor()

cur.execute('''
            CREATE TABLE IF NOT EXISTS "hello" (
                "a" CLOB NOT NULL,
                "b" CLOB NOT NULL,
                "c" CLOB NOT NULL,
                "d" CLOB NOT NULL,
                "e" CLOB NOT NULL
            )''')

with conn:
    for i in range(100000):
        cur.execute('''
        insert into hello (`a`, `b`, `c`, `d`, `e`) VALUES (?, ?, ?, ?, ?)
        ''', (str(uuid.uuid4()), str(uuid.uuid4()), str(uuid.uuid4()), str(uuid.uuid4()), str(uuid.uuid4())))

for i in range(9000):
    try:
        print("from {0} to {1}".format( i*500, i*500+500) )
        print("reached {} characters".format((i+1)*500*36*5))
        cur.execute("select * from hello order by a limit ?, ?", (i*500, (i*500+500)))
        print(cur.fetchall())
        time.sleep(1)
    except Exception as exc:
        print_augmented_traceback(*sys.exc_info())
        break

It stops at :

from 4000 to 4500
reached 810000 characters

Even if I have 32 GB of ram on 64bits OS ?

Traceback (most recent call last):

  File "c:\Users\abdelouahab\Desktop\New folder\db_tester.py", line 64, in <module>
    cur.execute("select * from hello order by a limit ?, ?", (i*500, (i*500+500)))
    ObfuscatedVFS = <class '__main__.ObfuscatedVFS'>
    ObfuscatedVFSFile = <class '__main__.ObfuscatedVFSFile'>
    __annotations__ = {}
    __builtins__ = <module 'builtins' (built-in)>
    __cached__ = None
    __doc__ = None
    __file__ = 'c:\\Users\\abdelouahab\\Desktop\\New folder\\db_tester.py'
    __loader__ = <_frozen_importlib_external.SourceFileLoader object at 0x00000232AD8DBBF0>
    __name__ = '__main__'
    __package__ = None
    __spec__ = None
    apsw = <module 'apsw' from 'C:\\Users\\abdelouahab\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\apsw\\__init__.cp312-win_amd64.pyd'>
    conn = <apsw.Connection object at 0x00000232ADBB5F30>
    cur = <apsw.Cursor object at 0x00000232AE028040>
    exc = TypeError('filename should be a string')
    i = 8
    obfuscate = <function obfuscate at 0x00000232AD8CA3E0>
    obfuvfs = <__main__.ObfuscatedVFS object at 0x00000232ADBE74D0>
    open_flags = 70
    print_augmented_traceback = <function print_augmented_traceback at 0x00000232AE0142C0>
    sys = <module 'sys' (built-in)>
    time = <module 'time' (built-in)>
    uuid = <module 'uuid' from 'C:\\Users\\abdelouahab\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\uuid.py'>

  File "c:\Users\abdelouahab\Desktop\New folder\db_tester.py", line 19, in xOpen
    return ObfuscatedVFSFile(self.base_vfs, name, flags)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    flags = [1054, 0]
    name = None
    self = <__main__.ObfuscatedVFS object at 0x00000232ADBE74D0>

  File "c:\Users\abdelouahab\Desktop\New folder\db_tester.py", line 25, in __init__
    apsw.VFSFile.__init__(self, inheritfromvfsname, filename, flags)
    filename = None
    flags = [1054, 0]
    inheritfromvfsname = ''
    self = <__main__.ObfuscatedVFSFile object at 0x00000232ADB83E30>

TypeError: filename should be a string

Edit : tried it on Linux (PopOS 64bits) same error at the same iteration

rogerbinns commented 10 months ago

I was unable to find a workaround such as increasing the cache size. I am about to commit the fix. It will be part of the next release which is soon.

abdelouahabb commented 10 months ago

It worked now, performance is a hit (takes more time after that limitation, but at least it works ! ) Here is the result on Linux

2024-01-06_16-21

abdelouahabb commented 10 months ago

I measured performances with this :

for i in range(90):
    try:
        t1 = time.time()
        print("from {0} to {1}".format( i*500, i*500+500) )
        cur.execute("select * from hello order by a limit ?, ?", (i*500, (i*500+500)))
        # print(cur.fetchall())
        print("reached {} characters".format((i+1)*500*36*5))
        perfor.append(time.time() - t1)
        print("performance is {}".format(perfor))
        time.sleep(1)
    except Exception as exc:
        print_augmented_traceback(*sys.exc_info())
        break

And it seems it's getting bigger and bigger !

2024-01-06_16-33

rogerbinns commented 10 months ago

You are measuring one thing:

select * from hello order by a limit ?, ?

To do the order by SQLite has to do a sort. The sort time is proportional to the number and size of items being sorted as you have found.

Turn on logging and SQLite will tell you if it is creating an automatic index.

Use EXPLAIN QUERY PLAN to see how SQLite executes your query. There is good documentation on the principles behind query execution.

You can do this to get a shell in the right place that knows how to format the query plan:

apsw.shell.Shell(db=conn).process_sql("explain query plan select * from hello order by a limit ?, ?", (i*500, (i*500+500)))

With your code above I get this query plan:

  id  parent  detail
   9       0  SCAN hello
  24       0  USE TEMP B-TREE FOR ORDER BY

SCAN means it is going through the whole table start to finish. The TEMP B-TREE is SQLite having to sort everything on demand.

I did this:

create index hello_a on hello(a)

Then the query plan becomes:

  id  parent  detail
  10       0  SCAN hello USING INDEX hello_a

It is instant because SQLite just has to seek to the right location in the index to produce the results and doesn't need to do a sort.

You also don't need that separate cursor and it isn't doing anything for you other than providing a potential source of bugs in the future when you use it while using it. All the cur.execute can and should be replaced with conn.execute.

abdelouahabb commented 10 months ago

You are measuring one thing:

select * from hello order by a limit ?, ?

To do the order by SQLite has to do a sort. The sort time is proportional to the number and size of items being sorted as you have found.

Turn on logging and SQLite will tell you if it is creating an automatic index.

Use EXPLAIN QUERY PLAN to see how SQLite executes your query. There is good documentation on the principles behind query execution.

You can do this to get a shell in the right place that knows how to format the query plan:

apsw.shell.Shell(db=conn).process_sql("explain query plan select * from hello order by a limit ?, ?", (i*500, (i*500+500)))

With your code above I get this query plan:

  id  parent  detail
   9       0  SCAN hello
  24       0  USE TEMP B-TREE FOR ORDER BY

SCAN means it is going through the whole table start to finish. The TEMP B-TREE is SQLite having to sort everything on demand.

I did this:

create index hello_a on hello(a)

Then the query plan becomes:

  id  parent  detail
  10       0  SCAN hello USING INDEX hello_a

It is instant because SQLite just has to seek to the right location in the index to produce the results and doesn't need to do a sort.

You also don't need that separate cursor and it isn't doing anything for you other than providing a potential source of bugs in the future when you use it while using it. All the cur.execute can and should be replaced with conn.execute.

Wow ! this just amazing ! Indexing the ordering reduced not only the execute time, but even the bug disapeared without the patch, the patch will improve even better ! Thank you so much !