Closed datbth closed 4 days ago
hi and thank you for raising this requirement. to me it makes sense indeed, being able to handle results the same way, regardless if they are empty or non-empty. we'll look into this (no timeline estimated)
so, had a word with the team and for now, we won't change our approach. Mostly because if there's no resultset, then there's no schema to create the Arrow record with.
So if your code seems to need error handling when a resultset is empty (thus result is an empty array, not an arrow.Record), perhaps changing the processing approach could help the fastest. Here's a pseudocode:
batches := result.GetBatches()
for batch := range batches {
records := range batch.Records()
for record := range records {
// do something on record only if it exists
}
}
and here's an actually working example on how to process arrow batches with GetArrowBatches
.
if you modify the generator to GENERATOR(ROWCOUNT=>0
then you can emulate an empty resultset and can also observe the code will still successfully run to completion - as it iterates over no matter how many records are in the batch.
Hope this helps.
Mostly because if there's no resultset, then there's no schema to create the Arrow record with.
I think this is more of an implementation issue. Any schemaful database would output empty results with proper schema.
Here is an empty result on Snowflake console:
When using the snowflake clients, the JSON response of an empty result also has a schema.
So if your code seems to need error handling when a resultset is empty
It's because my application needs to do downstream processing. It should be able to process any valid resultset as a normal relation/table/Arrow record. It expects to process 1 resultset (empty or not), not 0 resultset. Otherwise, for example, would you output a blank CSV or a Parquet with no schema when downloading from an empty Snowflake table?
I can understand if you say it's a technical limitation/oversight of the API (either the server side or the client side).
But I really doubt your database team would say this is "by design" that "an empty Arrow result does not have a schema".
@sfc-gh-pfus do you have any insight which might be worth sharing here ?
I just tried the python connector and it outputs a proper empty result:
import snowflake.connector
ctx = snowflake.connector.connect(
user=user,
password=password,
account=account,
warehouse=warehouse,
database=database,
login_timeout=3,
)
cur = ctx.cursor()
sql = "select 1 a where 1 = 0"
cur.execute(sql)
df = cur.fetch_pandas_all()
print(df.to_string())
print("--\n")
print("num columns: " + str(len(df.columns)))
print("size:" + str(df.size))
print("-----\n\n")
import pyarrow as pa
table = pa.Table.from_pandas(df)
print(table)
Output:
Empty DataFrame
Columns: [A]
Index: []
--
num columns: 1
size:0
-----
pyarrow.Table
A: int64
----
A: [[]]
Hi @datbth ! Let me summarize a couple of things.
SELECT 1, 256, 32768
produces three INTEGER
database columns. But in Arrow they may be described as int8
, int16
and int32
columns. Furthermore, another example. If you have a table which contains one INTEGER
database type, and it contains values from 1 to let's say 1000000 (incrementing by one), and you observe Arrow batches schema, you can see that for instance first record may have int8
(supposing that first batch contains only values below 128), then a couple of records containing int16
(up to the int16 limit) and only then you will have int32
records. I hope with that knowledge I convinced you why we don't want to create an empty record.SELECT 1
you should retrieve int8
column or at least this is something that is returned from backend.Thank you for the response.
If you use Arrow batches, please have in mind, that this is not a standard SQL anymore.
My point was to explain that my query returns a schemaful relation, so I expect to be able to retrieve that schema, especially when Arrow is also schemaful.
I'm trying to use Arrow to benefit from its performance values (minimal serialization cost, columnar processing, etc.).
I think it's reasonable to expect that, from the client perspective, Arrow is an interchangeable schemaful data format.
If you have a table which contains one INTEGER database type, and it contains values from 1 to let's say 1000000 (incrementing by one), and you observe Arrow batches schema, you can see that for instance first record may have int8 (supposing that first batch contains only values below 128), then a couple of records containing int16 (up to the int16 limit) and only then you will have int32 records.
I agree that this makes some sense. But then how about using the minimal data type (e.g. int8
in your example) when the result table is empty?
Based on what you showed it seems that in Python you firstly create pandas, which is schema aware, and then you convert it to arrow
I also don't know whether it is possible to retrieve Arrow records directly or not. But at least the library provides a natively supported way to retrieve the schema. Could you try discussing this matter with other client teams or the database team?
For me, currently, I'm able to construct the Arrow schema from the snowflake result schema. So this issue is not urgent to me.
However, I hope you could agree that having a schema is important (for further processing) and that the need for the schema is valid.
The technical limitation/optimization that you described is a valid reason to me, but I also hope that it could be improved.
I see your point @datbth . Nevertheless, we currently have 7 native drivers. We try to minimize code duplication and keep consistency in our codebases, especially if something may be implemented on the backend side. And second huge point is - we don't create arrow data in driver, we just propagate what backend sends to us and we don't want to change such approach (we had this discussion more than once before). My suggestion is to work with your AM/SE and file a backend request to fill the schema on that end and hopefully someone will implement it :) 🤞 🤞 🤞
My suggestion is to work with your AM/SE and file a backend request to fill the schema on that end and hopefully someone will implement it
I see. I was hoping this gets propagated to your backend team.
But I understand. Will try to file a request when I get the time.
Thank you.
I think it may be faster if it is an external requirement :)
Can we close the issue now?
Yeah ok. Thanks
If you have a table which contains one INTEGER database type, and it contains values from 1 to let's say 1000000 (incrementing by one), and you observe Arrow batches schema, you can see that for instance first record may have int8 (supposing that first batch contains only values below 128), then a couple of records containing int16 (up to the int16 limit) and only then you will have int32 records.
Such varying schemas in a single resultset would be hard to deal with.
And to produce such varying schemas, the server side would have to compute the min/max of each column of each batch, which can be costly.
I just went ahead and tried this query:
select 1 FROM table (generator(rowcount => 10000))
union all
select 123456789123456789.123456789 FROM table (generator(rowcount => 10000))
It yields 2 batches, containing 8192 rows (in 1 Arrow Record) and 11808 rows (in 2 Arrow Records). But both batches (all Arrow records) have this same schema:
arrow.Decimal128Type {Precision: 27, Scale: 9}
Tried another one:
select 1 FROM table (generator(rowcount => 10000))
union all
select 123456789123456789 FROM table (generator(rowcount => 10000))
and also found that all the result Arrow records have this same schema:
arrow.Int64Type{}
So I think there has been a mistake in your statement.
My guess is that it is because you just select static data, but I'm not 100% sure. I think server does not compute it on select time, but on storage time - this is the optimization to save the disc source. I'm not an expert on how backend compresses integers, what are the rules behind it. I only now, that we already had customers that were unhappy with this compression and schema variations and there is ongoing task to implement a parameter to disable this behaviour on the backend side.
Thank you for the answer.
there is ongoing task to implement a parameter to disable this behaviour on the backend side
Is there any way I can get notified when such parameter is released?
What is the current behavior?
When the query returns empty result (e.g.
SELECT 1 WHERE 1 = 0
),GetArrowBatches
returns an empty array.What is the desired behavior?
Empty result is still a valid result.
GetArrowBatches
should return an empty Arrow Record.Otherwise, downstream processing would have to treat empty results as errors and that does not make sense.
How would this improve
gosnowflake
?Allow downstream processing/usage to consume empty results properly.
References, Other Background
This could be a server issue if the server is not responding any "chunks".