opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
493 stars 175 forks source link

Value validation of `Field` classes needs to be improved #1601

Open Ariana-B opened 2 weeks ago

Ariana-B commented 2 weeks ago

The ability to validate a field value is inconsistent between Field classes, as is the moment at which a field's value is evaluated (and by extension, the source of the resultant error). PgDocField and inheriting classes have a parse_value method which, when overwritten by non-string fields (e.g. IntDocField, DateDocField, etc, but not SimpleDocField) so as to cast the value to a specific type, will raise a ValueError or similar depending on the library used. NativeField, however, has no such method. Unless external logic explicitly calls parse_value, such as in datacube-explorer when handling url queries, it is not invoked until extract or evaluate is called. For all Fields, this then means that an invalid value often causes a sqlalchemy error. The time field is typically an outlier in this regard, as it more often gets passed to some sort of datetime function before it gets the chance to be extracted/evaluated via DateDocField.

These discrepancies can be highlighted by comparing results when calling dc.find_datasets and dc.index.datasets.search with invalid values for fields of different types.

import datacube
dc = datacube.Datacube()
dc.find_datasets(product="ga_ls_wo_3", limit=1, time="asdf")
> ParserError: Unknown string format: asdf present at position 0

(via pandas_to_datetime(t, utc=True, infer_datetime_format=True).to_pydatetime() in Query) Compare to the same query when calling search directly:

list(dc.index.datasets.search(product="ga_ls_wo_3", limit=1, time="asdf"))
> DataError: (psycopg2.errors.InvalidDatetimeFormat) invalid input syntax for type timestamp: "asdf"

We see similar results with spatial fields such as lat. For most other fields though, both find_datasets and search raise the same error:

dc.find_datasets(product="ga_ls_wo_3", limit=1, dataset_maturity=1)
> ProgrammingError: (psycopg2.errors.UndefinedFunction) operator does not exist: text = integer

Providing a SimpleDocField as a Range also has differing outcomes:

dc.find_datasets(product="ga_ls_wo_3", limit=1, dataset_maturity=["asdf", "asdf"])
> NotImplementedError: Simple field between expression

list(dc.index.datasets.search(product="ga_ls_wo_3", limit=1, dataset_maturity=["asdf", "asdf"]))
> []