great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.8k stars 1.51k forks source link

multiple sample_using methods fail for BigQuery datasource #6135

Closed rbertsche closed 2 weeks ago

rbertsche commented 1 year ago

Describe the bug When using the sampling_methods:

sample_using_random

There was a previous issue opened for this, but it has since been closed without resolution: https://github.com/great-expectations/great_expectations/issues/4429 i.e. This is the SQL generated by the Data Assistant. You can see that :

SELECT count(*) AS `table_row_count` 
FROM (SELECT * 
FROM `bq_project.schema.table` 
WHERE EXTRACT(year FROM `_date`) = @`param_1` AND EXTRACT(month FROM `_date`) = @`param_2` AND EXTRACT(day FROM `_date`) = @`param_3` ORDER BY random()
 LIMIT @`param_4`) AS `great_expectations_sub_selection`

BQ throws the following Error message:

Function not found: random; Did you mean rand?

Based on this information, it seems this can be fixed by recognizing when the sample method is using BQ, and pass in a BQ dialect specific rand function, and not the generic SQL random().

The line of code where this issue is occurring is: https://github.com/great-expectations/great_expectations/blob/b3761729b156367229b5cd8895d225cb13d3267a/great_expectations/execution_engine/split_and_sample/sqlalchemy_data_sampler.py#L174

Using mod

There are similar SQL dialect issues when usingsample_using_mod. This method injects the mod symbol % into the SQL. This is not a supported function in the BQ dialect of SQL. B! requires mod to use the function MOD(param1, param2)

Using md5

There are a couple of errors with the md5 function. Firstly, the doc describes the function name as sample_using_hash. However, that is not a valid function for SQLAlchemy based runs. Instead, the correct name in sample_using_md5. I believe the correct action would be to change the function name to sample_using_hash to align with the other execution engines, as well as the documentation.

There is also an issue with execution on BQ. The problem lies in the generated function. In this example, I replaced the hash_digits = 1, and hash_value = 'f'. I manually replaced them in the query for clarity:

SELECT count(*) AS `table_row_count` 
FROM (SELECT * 
FROM `project.schema.table` 
WHERE EXTRACT(year FROM `_date`) = @`param_1` AND EXTRACT(month FROM `_date`) = @`param_2` AND EXTRACT(day FROM `_date`) = @`param_3` AND right(md5(CAST(`column_name` AS STRING)), 1) = 'f') AS `great_expectations_sub_selection`

The error thrown is: No matching signature for operator = for argument types: BYTES, STRING. STRING and BYTES are different types that are not directly comparable.

This error is directly related to the return types of BQ in certain functions. The md5() function returns type BYTES, and the right() method will return type BYTES or STRING depending on the input. Since md5() always returns BYTES, then right() always returns BYTES. The 'f' parameter that is being passed into the query is of type STRING. BigQuery does not allow = comparisons between BYTES and STRINGS. Therefore, the query returns an error. https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions#md5 https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#right

Fixing this error may be possible by wrapping the right statement in a cast to string function or the hash_value into bytes. https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_functions#cast_as_bytes

To Reproduce Steps to reproduce the behavior:

  1. Create a BQ datasource connection that uses the sampling_method: sample_using_random
  2. Run the DataAssistant on that new Data Connection
  3. See error

Expected behavior This sample_using_random should recognize that this is a Bigquery connection, and substitute the appropriate random function for the bq dialect. In this case it is rand. If that were done, the DataAssistant would run and execute correctly.

This doesn't just apply to the DataAssistant, but all expectations run on a sample_using_random configured BigQuery Dataset.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

AFineDayFor commented 1 year ago

Howdy @rbertsche :wave: thank you for raising this with us and being a part of our Community :bow:

We'll review this wiht the team, thank you so very much for bringing this to our attention and your patience

rdodev commented 1 year ago

Hey @rbertsche apologies for drop in comms. Is this issue still relevant? Were you able to get around it?

billdirks commented 1 year ago

This reproduces in the latest Great Expectations. Thanks for filing it!

molliemarie commented 2 weeks ago

Hello @rbertsche. With the upcoming launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.

To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).

You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.

Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗