Open omad opened 3 months ago
Interesting that the error is coming from a "LegacyCursorResult" object, I wonder if sqlalchemy has changed
That's what it sounds like to me. Or ODC has changed...
@Ariana-B or @SpacemanPaul: I know that there's been work on updating to SQLAlchemy 2 in ODC, Is that only in ODC 1.9+ or was it in 1.8 as well?
I think this might even be a SQLAlchemy 1.4 thing (which I believe explorer does use): https://stackoverflow.com/a/58660606 And from the docs;
Within the scope of the 1.x series of SQLAlchemy, Core SQL results in version 1.4 return an instance of LegacyCursorResult which takes the place of the CursorResult class used for the 1.3 series and previously. This object returns rows as LegacyRow objects, which maintains Python mapping (i.e. dictionary) like behaviors upon the object itself. Going forward, the Row._mapping attribute should be used for dictionary behaviors.
I agree that force-refresh is the correct move here — it should replace everything Explorer knows about the product. Will look into the crash, but to answer the other Qs:
What can we do to make Explorer resilient to Database modifications? Ideally we need to fix ODC to support mutating the database, but there's going to be a few assumptions about datasets not being allowed to be deleted that will have flow on impacts.
The issue is that ODC's tables let us efficiently see recently added/updated/archived rows in the ODC schema, but not deleted rows (as they disappear completely).
To handle deletions efficiently (ie, incrementally, without scanning the whole table) I believe we'd need to extend ODC's tables to add a "deleted row log" table that's populated via a trigger. Explorer could do that if we really want it.
The "ideal" way to do it at the moment without code changes is to archive the datasets first, run cubedash-gen
so it will see that those datasets as recently archived, and then we're free to delete them from ODC. This was never seen as a large priority as we used to say prod datasets would only ever be archived (so the provenance record remains), but that's idealistic :)
Does this SQL query make any sense? Should a 'QA' feature like this be included in Explorer?
It does, though it's a broad brush to use. The raw product count is one part of ~three. A more thorough query could be put on the audit pages of Explorer. Checking, for instance, the spatial table counts (which should have deletions when a dataset is removed/archived), the combined year counts, and how recent the summary is compared to most-recently-changed datasets in underlying ODC.
SQLAlchemy 2 is only in datacube-1.9.
1.8 should still be on 1.4. But the later versions 1.4.x versions have a lot of changes to help users prepare for the migration to 2.0 - might be one of those.
I don't know if it is related to this, but after fixing the missing lxml_html_clean dependency (#584), many tests on current HEAD of the develop branch also crash with a TypeError on LegacyCursorResult:
...
region_counts = Counter(
> dict(
self._engine.execute(
select(
[
DATASET_SPATIAL.c.region_code.label("region_code"),
func.count(),
]
)
.where(where_clause)
.group_by("region_code")
)
)
)
E TypeError: 'LegacyCursorResult' object is not subscriptable
cubedash/summary/_summarise.py:170: TypeError
The LegacyCursorResult type errors were first seen on develop after @omad merged #580 on 18 March, and it looks like commit e0aa69f3d34903d7feb722a4e209525e5f975cad introduced the error (commit a6c10790966fe1271424124885a6aa6364c1384d just before was also red, but the integration tests passed for that).
Had a second look, https://github.com/opendatacube/datacube-explorer/pull/580#discussion_r1554719430 isn't obvious to me.
The LegacyCursorResult type errors were first seen on develop after @omad merged #580 on 18 March, and it looks like commit e0aa69f introduced the error (commit a6c1079 just before was also red, but the integration tests passed for that).
Good catch @pjonsson!
That auto-modernise does break it. Here's a simple reproduction that throws the same exception:
from sqlalchemy import create_engine, select, func
from collections import Counter
engine = create_engine('sqlite:///:memory:')
result2 = Counter(
dict(
engine.execute(
select(
[
select(1).label("item"),
select(2).label("count")
]
)
)
)
)
But that's not the only bad part of the auto-modernise — it's also changing the == None
lines in sqlalchemy queries to is None
.
They have different semantics as the is
operator could never be overridden to be part of sqlalchemy's API (but __eq__
could). This could drastically change some logic.
.is_(None))
, which didn't exist when Explorer was first written.)This might be a good excuse to move from shed over to ruff, as we have with other repositories. Formatting is fine and we don't need the "modernisation" aspects.
Reverting those changes in #587
I also just switched to Ruff from black/isort/flake8 and have good experiences so far.
I don't know anything about shed, but I know Ruff lint has a modernisation category that can upgrade various things because it was eager to switch from Optional[t]
to t | None
on Python 3.10.
While managing the DEA ODC Databases, we've had to do some manual deletions and updates to the
agdc
database tables, and I suspect this has made Explorer unhappy.We've used a 'QA' SQL query in the past, which compares the number of datasets explorer knows about vs the number of datasets ODC knows about.
Which shows a bunch of mismatched Products:
In an attempt to update this, I thought i could run
cubedash-gen --force-refresh
on each of the out of sync products, however when I try this, i get a crash:This did seem to succeed in getting the dataset counts in sync, but I'm not sure what state it has left the database in.
Questions
Versions
Docker Image:
docker.io/opendatacube/explorer@sha256:026ac9f06ad41abdf62dd5c8dc3206f6595bc25564f029bbcec574e64806d317
Explorer version: 2.10.2.dev20+ge0aa69f3 Core version: 1.8.17 Python: 3.10.12 SQLAlchemy Version: 1.4.52