michalc / sqlite-s3-query

Python functions to query SQLite files stored on S3
MIT License
251 stars 15 forks source link

Do not copy undefined data #4

Closed mumbleskates closed 3 years ago

mumbleskates commented 3 years ago

do not trust the s3 service to return the exact right amount of data. we do not even require that the url have a secure scheme, so we definitely should not be implicitly trusting that the correct amount of data came back from the remote service before copying the presumed amount of memory.

michalc commented 3 years ago

Thanks for pointing this out... (after pondering for a bit) I think I agree it should handle the case of the server not returning the right amount of bytes.

Small thing: I'm not sure about the check on type(data) is not bytes... a well behaving HTTP client shouldn't return undefined as the body, e.g. if there are no bytes in the body, it would be the empty byte string b'', rather than undefined (looking at https://www.python-httpx.org/api/#response it does seem to suggest content is always bytes)?

Also, I think I'm fairly pro sticking with the lambda as-is, pro not using another exception, and also pro being quite explicit about when the error codes are returned. So something like:

try:
    data = get_range(i_ofst, i_ofst + i_amt - 1)
except Exception:
    return SQLITE_IOERR_SHORT_READ

if len(data) != amt:
    return SQLITE_IOERR_SHORT_READ

memmove(p_out, data, i_amt)
return SQLITE_OK
mumbleskates commented 3 years ago

I'm not sure what kinds of HTTP client you anticipate being used here. If the contract is that the content attribute can only ever contain a bytes object, never an interpretation of the response's encoding, then that's probably fine. It seems sensible and is very inexpensive to at least check.

As for the lambda, get_range is the only lambda in the entire VFS. Additionally seems to stand to reason that its responsibility is to actually fetch the byte range from the remote web service. If it gets a different number of bytes, clearly something went wrong, just as much as it has if it returns a 403 or a 5XX error. Leaking concerns about whether or not get_range actually worked or whether it has produced a totally different result because of remote behavior you need to think about seems bad. Wrong-sized data is an exceptional case, so just raise an exception.

mumbleskates commented 3 years ago

So anyway, feel free to take whatever option you wanted with most of this but the library is not safe to use as-is.

michalc commented 3 years ago

Understood. Will merge in + make changes

mumbleskates commented 3 years ago

Thanks! Looks like it's undergone significant changes now anyway