sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.87k stars 204 forks source link

ISO 8601 format check fails on valid dates. #2098

Open Ingmarvdg opened 3 months ago

Ingmarvdg commented 3 months ago

When checking if a datetime string is ISO 8601 compliant, some valid datetimes fail.

Dates in the 10th month.

The date 2020-10-11 fails while the dates 2020-09-11 and 2020-11-12 are fine. It seems this is caused by this section in the regular expression ?((0[0-9]|1[12]) that should be ?((0[0-9]|1[0-2]).

Dates before 1900 or after 2099

The dates 2100-01-01 and 1899-01-01 fail due to this section in the regular expression *(19|20)[[:digit:]][[:digit:]].

tools-soda commented 3 months ago

SAS-3692

pholser commented 3 months ago

Same goes for any valid format value that would involve the year or year4 regexes: date inverse, for example.

m1n0 commented 2 months ago

hi thanks for reporting, could you verify if this is not fixed by https://github.com/sodadata/soda-core/pull/2128?

pholser commented 2 months ago

@Ingmarvdg would you be willing to try out the above fix for your case?

pholser commented 2 months ago

@m1n0 @Ingmarvdg I can confirm that #2128 corrects the problem. However, I noticed also that ISO 8601 dates should accept 24-hr times, and currently they don't. I am issuing another PR to correct.

pholser commented 2 months ago

See also #2133

Ingmarvdg commented 2 months ago

Hi @pholser I am using the spark-df version of soda-core, and it seems the problem is not yet corrected:

from soda.scan import Scan

scan = Scan()
scan.set_scan_definition_name("tmp")
scan.add_spark_session(spark, data_source_name="tmp")
scan.set_data_source_name("y")

scan.add_sodacl_yaml_str(
    """
checks for tmp:
    - invalid_count(datetime_string) = 0:
        valid format: date iso 8601
"""
)

df = spark.createDataFrame(
    [("1623-10-11T10:10:10.0000+01:00",)], schema=["datetime_string"] # This should pass
)
df.createOrReplaceTempView("tmp")

result_code = scan.execute()
results = scan.get_logs_text()
print(results)

Worse yet, the error has to do with the scan execution instead of resulting in a fail:

[19:05:02] Query execution error in y.a.failed_rows[invalid_count]: 
SELECT * FROM y 
 WHERE NOT (a IS NULL) AND NOT (a rlike('^ *(19|20)\\d\\d-?((0[0-9]|1[12])-?([012][0-9]|3[01])|W[0-5]\\d(-?[1-7])?|[0-3]\\d\\d)([ T](0[0-9]|1[012])(:?[0-5][0-9](:?[0-5][0-9]([.,]\\d+)?)?)?([+-](0[0-9]|1[012]):?[0-5][0-9]|Z)?)? *$')) 
 LIMIT 100

[19:05:02] Error occurred while executing scan.
  | 'DataFrame' object has no attribute 'offset'
INFO   | Soda Core 3.3.10
INFO   | Using DefaultSampler
ERROR  | Query execution error in y.a.failed_rows[invalid_count]: 
SELECT * FROM y 
 WHERE NOT (a IS NULL) AND NOT (a rlike('^ *(19|20)\\d\\d-?((0[0-9]|1[12])-?([012][0-9]|3[01])|W[0-5]\\d(-?[1-7])?|[0-3]\\d\\d)([ T](0[0-9]|1[012])(:?[0-5][0-9](:?[0-5][0-9]([.,]\\d+)?)?)?([+-](0[0-9]|1[012]):?[0-5][0-9]|Z)?)? *$')) 
 LIMIT 100 | 
ERROR  | Error occurred while executing scan. | 'DataFrame' object has no attribute 'offset'
pholser commented 2 months ago

@Ingmarvdg -- ok, didn't know about the spark-df version of soda-core. Has it incorporated the above change? I'm satisfied that the iso 8601 date check is improved with the change in soda-core itself.

pholser commented 2 months ago

The regex being fed to the query still appears to be the old incorrect one. I don't believe you've incorporated the soda-core update above into your setup.

pholser commented 2 months ago

@Ingmarvdg what version are you using? I doubt the fix is in a released version yet.

Ingmarvdg commented 2 months ago

I pulled the most recent version of the main branch, then 'pip install .' from the spark-df folder. That should give your changes right?

pholser commented 2 months ago

Perhaps I changed something that didn't affect your bug.

pholser commented 2 months ago

Try also https://github.com/sodadata/soda-core/pull/2133

I added a format test for "1623-10-11T10:10:10.0000+01:00", and it seems to pass when the core tests are run for spark-df: https://github.com/sodadata/soda-core/actions/runs/9962160324/job/27525299481