gluent / goe

GOE: a simple and flexible way to copy data from an Oracle Database to Google BigQuery.
Apache License 2.0
8 stars 2 forks source link

Refactor offload metadata and interactions in orchestration #52

Closed abb9979 closed 8 months ago

abb9979 commented 9 months ago

The GOE metadata repository and supporting database objects were refactored by issue 18. Minimal work was done in the Python as an interim "shim" to enable orchestration to use the existing method args and metadata dict keys. The orchestration therefore needs to be refactored to complete the integration.

Things to note:

  1. There are a number of TODO comments in the code for immediate attention (search for TODO: Issue 18)
  2. The existing metadata dict has a number of attributes that have been remapped to the new Oracle type attributes when getting/saving. In some cases, the developer may wish to remove the existing metadata attributes in favour of the new ones, though in other cases the old key names may be so ubiquitous in the orchestration code that it's easier to retain the old name (for example, INCREMENTAL_% dict keys are now named OFFLOAD_% in the Oracle type)
  3. The existing metadata dict has a number of attributes that are no longer relevant and need to be removed
  4. The various metadata methods (get, save etc) are keyed on hybrid_[owner|view] args. The "shim" prevented this from being an issue with the first push, but the hybrid_[owner|view] concept needs to be removed and all metadata methods should be keyed on frontend owner/name
  5. The current command execution UUID is part of the metadata that needs to be saved. Historically, the execution_id value was added as an additional arg to the save_offload_metadata methods, but this "loose" arg needs to be cleaned up. The execution UUID should be in the metadata dict
abb9979 commented 9 months ago

Another item to fix as part of this refactor is to add a hard reset of metadata when using --reset-backend-table, using delete_offload_metadata.

Currently there is no delete of old metadata, which means that when the resetting offload runs a save_offload_metadata later in the process, it will find and update an existing metadata record rather than insert a new one. This leaves the original execution UUID and original offload version from the old offload in place.

abb9979 commented 9 months ago

Note as a result of Issue 60, the metadata_dict['OFFLOAD_SCN'] key is back, but mapped to/from an Oracle attribute named OFFLOAD_SNAPSHOT. Renaming the dict key to match the Oracle attribute could be another item for this issue.

nj1973 commented 9 months ago

Changes made:

  1. I've dealt with "TODO: Issue 18"
  2. Removed attributes that are no longer relevant
  3. Removed the hybrid owner shim
  4. Merged execution id into metadata rather than it being an independent parameter
  5. Renamed OFFLOAD_SCN to OFFLOAD_SNAPSHOT in Python code.

Still to do:

  1. Consider renaming dictionary keys to match new repo names
  2. Delete old metadata when using --reset-backend-table
  3. Investigate failing integration test tests/integration/persistence/test_orchestration_metadata.py, looks like GOE_VERSION is advancing when it shouldn't be
nj1973 commented 9 months ago

I'm going to merge these changes to main and re-open this issue to work on the outstanding items.

nj1973 commented 9 months ago

Reopening to work on outstanding items.

nj1973 commented 8 months ago

I merged to get an important bug fix onto main.

Completed tasks:

Still to do:

nj1973 commented 8 months ago

@abb9979, in Oracle procedure offload_repo.get_offload_metadata we return v.offload_version_current as the metadata version. This deviates from prior behaviour which returned the initial offload version.

As far as I can see we no longer use the version in metadata for anything, the original intention was in case we needed to vary behaviour depending on version, I think it was related to synthetic bucketing using a different algorithm after a specific version but continuing to use a deprecated algorithm for older offloaded tables.

Anyway, I have some failing tests that put and get hardcoded metadata expecting it to be unchanged but it now comes back with a different version. I either need to change the tests to match the new reality or change the code to match the old behaviour and return COMMAND_EXECUTION_INITIAL in place of COMMAND_EXECUTION_CURRENT. I wanted to get your thoughts.

abb9979 commented 8 months ago

@nj1973, as long as the COMMAND_EXECUTION_CURRENT is provided at the time of saving, then we can change the get() to give you the initial values for version and execution.