Open Ingmarvdg opened 3 months ago
SAS-3692
Same goes for any valid format
value that would involve the year
or year4
regexes: date inverse
, for example.
hi thanks for reporting, could you verify if this is not fixed by https://github.com/sodadata/soda-core/pull/2128?
@Ingmarvdg would you be willing to try out the above fix for your case?
@m1n0 @Ingmarvdg I can confirm that #2128 corrects the problem. However, I noticed also that ISO 8601 dates should accept 24-hr times, and currently they don't. I am issuing another PR to correct.
See also #2133
Hi @pholser I am using the spark-df version of soda-core, and it seems the problem is not yet corrected:
from soda.scan import Scan
scan = Scan()
scan.set_scan_definition_name("tmp")
scan.add_spark_session(spark, data_source_name="tmp")
scan.set_data_source_name("y")
scan.add_sodacl_yaml_str(
"""
checks for tmp:
- invalid_count(datetime_string) = 0:
valid format: date iso 8601
"""
)
df = spark.createDataFrame(
[("1623-10-11T10:10:10.0000+01:00",)], schema=["datetime_string"] # This should pass
)
df.createOrReplaceTempView("tmp")
result_code = scan.execute()
results = scan.get_logs_text()
print(results)
Worse yet, the error has to do with the scan execution instead of resulting in a fail:
[19:05:02] Query execution error in y.a.failed_rows[invalid_count]:
SELECT * FROM y
WHERE NOT (a IS NULL) AND NOT (a rlike('^ *(19|20)\\d\\d-?((0[0-9]|1[12])-?([012][0-9]|3[01])|W[0-5]\\d(-?[1-7])?|[0-3]\\d\\d)([ T](0[0-9]|1[012])(:?[0-5][0-9](:?[0-5][0-9]([.,]\\d+)?)?)?([+-](0[0-9]|1[012]):?[0-5][0-9]|Z)?)? *$'))
LIMIT 100
[19:05:02] Error occurred while executing scan.
| 'DataFrame' object has no attribute 'offset'
INFO | Soda Core 3.3.10
INFO | Using DefaultSampler
ERROR | Query execution error in y.a.failed_rows[invalid_count]:
SELECT * FROM y
WHERE NOT (a IS NULL) AND NOT (a rlike('^ *(19|20)\\d\\d-?((0[0-9]|1[12])-?([012][0-9]|3[01])|W[0-5]\\d(-?[1-7])?|[0-3]\\d\\d)([ T](0[0-9]|1[012])(:?[0-5][0-9](:?[0-5][0-9]([.,]\\d+)?)?)?([+-](0[0-9]|1[012]):?[0-5][0-9]|Z)?)? *$'))
LIMIT 100 |
ERROR | Error occurred while executing scan. | 'DataFrame' object has no attribute 'offset'
@Ingmarvdg -- ok, didn't know about the spark-df version of soda-core. Has it incorporated the above change? I'm satisfied that the iso 8601 date check is improved with the change in soda-core itself.
The regex being fed to the query still appears to be the old incorrect one. I don't believe you've incorporated the soda-core update above into your setup.
@Ingmarvdg what version are you using? I doubt the fix is in a released version yet.
I pulled the most recent version of the main branch, then 'pip install .' from the spark-df folder. That should give your changes right?
Perhaps I changed something that didn't affect your bug.
Try also https://github.com/sodadata/soda-core/pull/2133
I added a format test for "1623-10-11T10:10:10.0000+01:00", and it seems to pass when the core tests are run for spark-df: https://github.com/sodadata/soda-core/actions/runs/9962160324/job/27525299481
When checking if a datetime string is ISO 8601 compliant, some valid datetimes fail.
Dates in the 10th month.
The date
2020-10-11
fails while the dates2020-09-11
and2020-11-12
are fine. It seems this is caused by this section in the regular expression?((0[0-9]|1[12])
that should be?((0[0-9]|1[0-2])
.Dates before 1900 or after 2099
The dates
2100-01-01
and1899-01-01
fail due to this section in the regular expression*(19|20)[[:digit:]][[:digit:]]
.