opendatacube / datacube-explorer

Web-based exploration of Open Data Cube collections
Apache License 2.0
54 stars 31 forks source link

Out of Sync Explorer Tables + Crashes #581

Open omad opened 3 months ago

omad commented 3 months ago

While managing the DEA ODC Databases, we've had to do some manual deletions and updates to the agdc database tables, and I suspect this has made Explorer unhappy.

We've used a 'QA' SQL query in the past, which compares the number of datasets explorer knows about vs the number of datasets ODC knows about.

SELECT *, (OUTPUT_QUERY.agdc_ds_count - OUTPUT_QUERY.explorer_ds_count) as diff
FROM (
        select count(ds.id) as agdc_ds_count, p.dataset_count as explorer_ds_count, dt.id, p.name 
        from cubedash.product p, agdc.dataset_type dt, agdc.dataset ds 
        where 
            dt.name = p.name 
        and dt.id = ds.dataset_type_ref 
        and ds.archived is NULL 
        GROUP BY p.name, dt.id, p.dataset_count) AS OUTPUT_QUERY
WHERE NOT OUTPUT_QUERY.agdc_ds_count = OUTPUT_QUERY.explorer_ds_count;

Which shows a bunch of mismatched Products:

 agdc_ds_count | explorer_ds_count |  id  |            name            |  diff   
---------------+-------------------+------+----------------------------+---------
        419777 |            514922 |  479 | ga_ls_fc_3                 |  -95145
        415373 |            417815 |  478 | ga_ls_wo_3                 |   -2442
         39227 |             88696 | 1271 | ga_ls_wo_fq_cyear_3        |  -49469
        503454 |            503453 | 1798 | ga_s2am_ard_3              |       1
         65063 |            184167 | 1006 | ga_s2_ba_provisional_3     | -119104
          8639 |             10050 | 2953 | ga_s2ls_intertidal_cyear_3 |   -1411
         51381 |            204223 | 1600 | ga_s2_wo_provisional_3     | -152842

In an attempt to update this, I thought i could run cubedash-gen --force-refresh on each of the out of sync products, however when I try this, i get a crash:

cubedash-gen --force-refresh ga_s2ls_intertidal_cyear_3
Updating 1 products for LocalConfig<loaded_from=defaults, environment='default', config={'db_hostname': 'dea-dev-db', 'db_database': 'odc', 'db_username': 'explorer_writer', 'db_password': '***', 'index_driver': 'default', 'db_connection_timeout': '60'}>
Generating product summaries...
ga_s2ls_intertidal_cyear_3 refresh
2024-04-05T04:27:02.288888Z [warning  ] spatial_deletion_full_scan     after_date=None product_name=ga_s2ls_intertidal_cyear_3
2024-04-05T04:27:02.326885Z [warning  ] spatial_update.recreating_everything after_date=None product_name=ga_s2ls_intertidal_cyear_3
2024-04-05T04:33:35.828477Z [warning  ] forcing_refresh                product_name=ga_s2ls_intertidal_cyear_3
     ga_s2ls_intertidal_cyear_3 1900
2024-04-05T04:33:35.996976Z [error    ] product.error                  product=ga_s2ls_intertidal_cyear_3
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cubedash/generate.py", line 127, in generate_report
    result, updated_summary = store.refresh(
  File "/usr/local/lib/python3.10/dist-packages/cubedash/summary/_stores.py", line 1528, in refresh
    self._recalculate_period(
  File "/usr/local/lib/python3.10/dist-packages/cubedash/summary/_stores.py", line 1374, in _recalculate_period
    summary = self._summariser.calculate_summary(
  File "/usr/local/lib/python3.10/dist-packages/cubedash/summary/_summarise.py", line 170, in calculate_summary
    dict(
TypeError: 'LegacyCursorResult' object is not subscriptable
ga_s2ls_intertidal_cyear_3 error
finished. 1 error

This did seem to succeed in getting the dataset counts in sync, but I'm not sure what state it has left the database in.

Questions

Versions

Docker Image: docker.io/opendatacube/explorer@sha256:026ac9f06ad41abdf62dd5c8dc3206f6595bc25564f029bbcec574e64806d317 Explorer version: 2.10.2.dev20+ge0aa69f3 Core version: 1.8.17 Python: 3.10.12 SQLAlchemy Version: 1.4.52

jeremyh commented 3 months ago

Interesting that the error is coming from a "LegacyCursorResult" object, I wonder if sqlalchemy has changed

omad commented 3 months ago

That's what it sounds like to me. Or ODC has changed...

@Ariana-B or @SpacemanPaul: I know that there's been work on updating to SQLAlchemy 2 in ODC, Is that only in ODC 1.9+ or was it in 1.8 as well?

Ariana-B commented 3 months ago

I think this might even be a SQLAlchemy 1.4 thing (which I believe explorer does use): https://stackoverflow.com/a/58660606 And from the docs;

Within the scope of the 1.x series of SQLAlchemy, Core SQL results in version 1.4 return an instance of LegacyCursorResult which takes the place of the CursorResult class used for the 1.3 series and previously. This object returns rows as LegacyRow objects, which maintains Python mapping (i.e. dictionary) like behaviors upon the object itself. Going forward, the Row._mapping attribute should be used for dictionary behaviors.

jeremyh commented 3 months ago

I agree that force-refresh is the correct move here — it should replace everything Explorer knows about the product. Will look into the crash, but to answer the other Qs:

What can we do to make Explorer resilient to Database modifications? Ideally we need to fix ODC to support mutating the database, but there's going to be a few assumptions about datasets not being allowed to be deleted that will have flow on impacts.

The issue is that ODC's tables let us efficiently see recently added/updated/archived rows in the ODC schema, but not deleted rows (as they disappear completely).

To handle deletions efficiently (ie, incrementally, without scanning the whole table) I believe we'd need to extend ODC's tables to add a "deleted row log" table that's populated via a trigger. Explorer could do that if we really want it.

The "ideal" way to do it at the moment without code changes is to archive the datasets first, run cubedash-gen so it will see that those datasets as recently archived, and then we're free to delete them from ODC. This was never seen as a large priority as we used to say prod datasets would only ever be archived (so the provenance record remains), but that's idealistic :)

jeremyh commented 3 months ago

Does this SQL query make any sense? Should a 'QA' feature like this be included in Explorer?

It does, though it's a broad brush to use. The raw product count is one part of ~three. A more thorough query could be put on the audit pages of Explorer. Checking, for instance, the spatial table counts (which should have deletions when a dataset is removed/archived), the combined year counts, and how recent the summary is compared to most-recently-changed datasets in underlying ODC.

SpacemanPaul commented 3 months ago

SQLAlchemy 2 is only in datacube-1.9.

1.8 should still be on 1.4. But the later versions 1.4.x versions have a lot of changes to help users prepare for the migration to 2.0 - might be one of those.

pjonsson commented 3 months ago

I don't know if it is related to this, but after fixing the missing lxml_html_clean dependency (#584), many tests on current HEAD of the develop branch also crash with a TypeError on LegacyCursorResult:

...
            region_counts = Counter(
>               dict(
                    self._engine.execute(
                        select(
                            [
                                DATASET_SPATIAL.c.region_code.label("region_code"),
                                func.count(),
                            ]
                        )
                        .where(where_clause)
                        .group_by("region_code")
                    )
                )
            )
E           TypeError: 'LegacyCursorResult' object is not subscriptable
cubedash/summary/_summarise.py:170: TypeError
pjonsson commented 3 months ago

The LegacyCursorResult type errors were first seen on develop after @omad merged #580 on 18 March, and it looks like commit e0aa69f3d34903d7feb722a4e209525e5f975cad introduced the error (commit a6c10790966fe1271424124885a6aa6364c1384d just before was also red, but the integration tests passed for that).

pjonsson commented 2 months ago

Had a second look, https://github.com/opendatacube/datacube-explorer/pull/580#discussion_r1554719430 isn't obvious to me.

jeremyh commented 2 months ago

The LegacyCursorResult type errors were first seen on develop after @omad merged #580 on 18 March, and it looks like commit e0aa69f introduced the error (commit a6c1079 just before was also red, but the integration tests passed for that).

Good catch @pjonsson!

That auto-modernise does break it. Here's a simple reproduction that throws the same exception:

    from sqlalchemy import create_engine, select, func
    from collections import Counter

    engine = create_engine('sqlite:///:memory:')
    result2 = Counter(
        dict(
            engine.execute(
                select(
                    [
                        select(1).label("item"),
                        select(2).label("count")
                    ]
                )
            )
        )
    )

But that's not the only bad part of the auto-modernise — it's also changing the == None lines in sqlalchemy queries to is None.

They have different semantics as the is operator could never be overridden to be part of sqlalchemy's API (but __eq__ could). This could drastically change some logic.

jeremyh commented 2 months ago

This might be a good excuse to move from shed over to ruff, as we have with other repositories. Formatting is fine and we don't need the "modernisation" aspects.

jeremyh commented 2 months ago

Reverting those changes in #587

pjonsson commented 2 months ago

I also just switched to Ruff from black/isort/flake8 and have good experiences so far.

I don't know anything about shed, but I know Ruff lint has a modernisation category that can upgrade various things because it was eager to switch from Optional[t] to t | None on Python 3.10.