great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
10.01k stars 1.55k forks source link

GCS Data Docs lost previous testing result. It's newly created whenever testing are started. #3705

Closed gabwon9 closed 1 year ago

gabwon9 commented 3 years ago

Describe the bug Data Docs are created in the Google Cloud Storage (bucket). The Data Docs get overwritten each time a new test is executed via Cronjob. i.e. the output will always contain one result.

To Reproduce Steps to reproduce the behavior:

  1. In Google Cloud environment, schedule a cronjob so the Great Expectation test suite gets executed.
  2. In Great Expectation Python script, checkpoint is executed like result = context.run_checkpoint("checkpoint_feedback")
  3. Test result should automatically update the Data Docs in GCS Bucket.
  4. Review the Data Docs after the testing is done.

Expected behavior The previous test result of Data Docs ramains.

Actual behavior Newly executed test result overwrites the Data Docs. As a result, previous test result is removed.

Environment (please complete the following information):

Additional context Expected behaviors can be seen if the test is executed locally but pointing to the GCS resources. i.e. Data Docs retain all executed test results.

Test Script Examples

  1. Test script can be executed by 'python start.py'+
  2. start.py will execute 'context.run_checkpoint("checkpoint_feedback")' and UpdateDataDocsAction is defined at checkpoint_feedback.yml.

great_expectation.yml

config_version: 3.0

datasources:
  local_datasource:
    execution_engine:
      class_name: PandasExecutionEngine
      module_name: great_expectations.execution_engine
    class_name: Datasource
    module_name: great_expectations.datasource
    data_connectors:
      default_inferred_data_connector_name:
        class_name: InferredAssetFilesystemDataConnector
        default_regex:
          group_names:
            - data_asset_name
          pattern: (.*)
        module_name: great_expectations.datasource.data_connector
        base_directory: ./data
      default_runtime_data_connector_name:
        batch_identifiers:
          - default_identifier_name
        class_name: RuntimeDataConnector
        module_name: great_expectations.datasource.data_connector
  gcs_datasource:
    execution_engine:
      class_name: PandasExecutionEngine
      module_name: great_expectations.execution_engine
    class_name: Datasource
    module_name: great_expectations.datasource
    data_connectors:
      default_runtime_data_connector_name:
        batch_identifiers:
          - default_identifier_name
        class_name: RuntimeDataConnector
        module_name: great_expectations.datasource.data_connector
      gcs_inferred_input_data_connector:
        prefix: aggregate/
        class_name: InferredAssetGCSDataConnector
        default_regex:
          pattern: (.*)
          group_names:
            - data_asset_name
        module_name: great_expectations.datasource.data_connector
        bucket_or_name: gcs-input_eedback
config_variables_file_path: config/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/

stores:
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

  evaluation_parameter_store:
    # Evaluation Parameters enable dynamic expectations. Read more here:
    # https://docs.greatexpectations.io/en/latest/reference/core_concepts/evaluation_parameters.html
    class_name: EvaluationParameterStore

  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store

data_docs_sites:
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
  gs_site:
    class_name: SiteBuilder
    store_backend:
      class_name: TupleGCSStoreBackend
      project: test-pjt
      bucket:  google-bucket-feedback
      prefix: implicit-feedback
      base_public_path: https://www.test.com/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

anonymous_usage_statistics:
  enabled: true
  data_context_id: a44380b8-c86d-4da8-9bf6-89989e8bc39b
concurrency:
  enabled: false
notebooks:

checkpoint_feedback.yml

name: checkpoint_feedback
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-feedback-${data_file_name}"
expectation_suite_name:
batch_request:
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: store_evaluation_params
    action:
      class_name: StoreEvaluationParametersAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
      site_names: []
  - name: send_slack_notification_on_validation_result # name can be set to any value
    action:
      class_name: SlackNotificationAction
      slack_webhook: ${validation_notification_slack_webhook}
      notify_on: failure # possible values: "all", "failure", "success"
      notify_with: # optional list containing the DataDocs sites to include in the notification. Defaults to including links to all configured sites.
        - gs_site
      renderer:
        module_name: great_expectations.render.renderer.slack_renderer
        class_name: SlackRenderer
evaluation_parameters: {}
runtime_configuration: {}
validations:
  - batch_request:
      datasource_name: gcs_datasource
      data_connector_name: gcs_inferred_input_data_connector
      data_asset_name: ${data_file}
      data_connector_query:
        index: -1
    expectation_suite_name: expect_feedback_validation_suite
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:

test.py

def run_test():
    context = great_expectations.get_context()
    assets = context.get_available_data_asset_names()

    if assets is not None:
        gcs_datas = assets["gcs_datasource"]["gcs_inferred_input_data_connector"]

        for data_file in gcs_datas:
            os.environ["data_file_name"] = os.path.basename(data_file)
            os.environ["data_file"] = data_file

            try:
                result = context.run_checkpoint("checkpoint_feedback")
            except Exception as ex:
                print("Unexpected Exception : {}, {}".format(ex, type(ex)))

        # context.build_data_docs()

if __name__ == "__main__":
    run_test()
cdkini commented 3 years ago

Hey @GabwonPark! Thanks so much for opening up this issue.

In order to best diagnose the issue at hand, it would be really helpful to have the underlying configuration you're using. Would you mind providing your data context config, scripts, and any other pertinent information (please omit any sensitive details)?

We'll review that information and try to determine what's going on.

gabwon9 commented 3 years ago

@cdkini I added script examples in description above. Please check Test Script Examples. Thank you.

cdkini commented 3 years ago

@GabwonPark if you actually look in the bucket, are the doc sites being rewritten each time? How about the validation results?

Could you provide your configs for your stores (expectations, validations, and checkpoints)?

Just looking for some additional context so we can narrow down the issue!

gabwon9 commented 3 years ago

@cdkini I updated great_expectaion.yml at 'Test Script Examples' in description. Is it enough ? The doc sites is rewritten each time. Validation results for current execution result are only maintained. In other words, old validation results is removed. I followed the guide below for v3 api. https://legacy.docs.greatexpectations.io/en/stable/guides/how_to_guides/configuring_data_docs/how_to_host_and_share_data_docs_on_gcs.html

gabwon9 commented 2 years ago

@cdkini Is there any update ? or do you need any other data ?

cdkini commented 2 years ago

Hey @GabwonPark! Apologies for the delayed response.

Our team is still reviewing the details you've provided. I believe this should be sufficient but I'll let you know if we need anything else when debugging the issue. Please note that our team is out for the remainder of the week so we'll address this early next week.

Thanks!

gabwon9 commented 2 years ago

@cdkini Thank you for your reply!

cdkini commented 2 years ago

@GabwonPark have you run the CLI docs build or context.build_data_docs() at all before running your script?

My initial hunch here is that the UpdateDataDocsAction is running in an environment where the docs are not built. Do you actually run the context.build_data_docs() line in your script?

gabwon9 commented 2 years ago

@cdkini No, i did not that. I don't use context.build_data_docs() to update the data docs per test result. As you know, i used UpdateDataDocsAction only. Should i use context.build_data_docs() before new test execution to protect already exist data docs ? And we can use slack also for more fast response.

cdkini commented 2 years ago

@GabwonPark I would try initializing your docs sites with that build command and see if it makes a difference. I'll work to set up a GCS-based project so I can replicate your environment in the meanwhile. Let me know if things work out!

gabwon9 commented 2 years ago

@cdkini Thank you for your supporting !

cdkini commented 2 years ago

@GabwonPark sure thing! Did that happen to work?

gabwon9 commented 2 years ago

@cdkini No effect. docs is still removed when starting test. As i think, the datas of 'uncommitted' in local computer seems be replaced to the data of gcp data docs instead of merging into gcp data docs.

cdkini commented 2 years ago

@GabwonPark apologies for the delay. Could you please confirm that this is still an issue? Additionally, would you mind trying your script and overall approach on your local filesystem. I'm curious if the same overwriting happens in a different environment.

gabwon9 commented 2 years ago

@cdkini this is still issue . And it is not issue on local file system. It's reproducible on gcp environment. Did you try to test on gcp environment?

cdkini commented 2 years ago

@GabwonPark I'm not able to reproduce your issue unfortunately.

Could you try adding your gcs site name to this part of the config:

  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
      site_names: []

You may also have better success using batch requests to pass in variables (as opposed to the env var assignment you're using in your script).

Finally, if you check uncomitted/validations, do you see anything? Is that empty, does it contain a single validation that's being overwritten, or something else?

I'm still looking into the matter but please ensure that the configuration you've provided me is still accurate. Thanks!

gabwon9 commented 2 years ago

@cdkini Thank you for your reply. Did you test on gcloud environment or in your local laptop to reproduce this issue ? As i think, this issue can be reproduced on gcloud environment and it can't not be reproduced in local laptop as local site works well.

Additionally,

  1. For update_data_docs, i already tested it. No effect.
  2. For batch request, i will try to this. If you have any example for this, please share.
  3. For uncomitted/validations, i can't verify clearly because Google Cloud Kubernetes Container is terminated after great expectation test suite is executed. Great Expectation Test Suite execution process in kubernetes In Google Cloud Kubernetes environment, schedule a cronjob -> cronjob executes new job -> new container is created -> great expectation test suite is ran in newly created container and test result is saved to gcs bucket -> new container is terminated.

And happy new year !

talagluck commented 2 years ago

Hi @GabwonPark - sorry for the delay on this!

We still haven't been able to reproduce this on our side. That said, another user recently reported a similar issue. For them, the issue was that one of their environments had a Docker volume being mounted to the image which had another GE config directory which was causing a conflict. Since there were two GE configs being used in the local environment it was causing the other to overwrite as it was being viewed as a new configuration. They were able to resolve this by removing the Docker volume that was creating a conflict with the config.

Does this sound like a viable approach for you as well? It would be great to hear if this could work for you.

github-actions[bot] commented 2 years ago

Is this issue still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity.

It will be closed if no further activity occurs. Thank you for your contributions 🙇