sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.83k stars 194 forks source link

Yaml emitter error while executing scans concurrently #2064

Open Electrogenic opened 2 months ago

Electrogenic commented 2 months ago

Hi Soda team.

Background

We are using the soda-core-trino version 3.3.1 library inside of a Fast API server. The following code works when handling a single request at a time. However, when multiple requests come in at the same time (even just 2), we are running into the Yaml emitter error found in the stacktrace below. I'm wondering if there might be some thread safety issues going on within the ruaml yaml emitter, or in the way it is being used.

scan = Scan()
scan.set_data_source_name('default')
# Trino data source is configured
scan.add_configuration_yaml_str('<REDACTED>')
# I simplified the checks for this example, but I'm using about 240 in actuality.
checks = """
    checks for my_table:
    - schema:
        name: Check that field foo is present
        fail:
            when required column missing:
            - foo
    """
scan.add_sodacl_yaml_str(checks)
soda_scan_status_code = scan.execute()

Stacktrace

   File "/workspace/app/routes/v1/contracts/routes.py", line 419, in execute_soda_scann
     soda_scan_status_code = scan.execute()
   File "/workspace/.venv/lib/python3.10/site-packages/soda/scan.py", line 638, in execute
     self.scan_results = self.build_scan_results()
   File "/workspace/.venv/lib/python3.10/site-packages/soda/scan.py", line 72, in build_scan_results
     checks = [check.get_dict() for check in self._checks if check.outcome is not None and check.archetype is None]
   File "/workspace/.venv/lib/python3.10/site-packages/soda/scan.py", line 72, in <listcomp>
     checks = [check.get_dict() for check in self._checks if check.outcome is not None and check.archetype is None]
   File "/workspace/.venv/lib/python3.10/site-packages/soda/execution/check/check.py", line 363, in get_dict
     "identity": self.create_identity(with_datasource=True, with_filename=True),
   File "/workspace/.venv/lib/python3.10/site-packages/soda/execution/check/check.py", line 234, in create_identity
     identity_source_configurations_yaml = to_yaml_str(identity_source_configurations)
   File "/workspace/.venv/lib/python3.10/site-packages/soda/common/yaml_helper.py", line 13, in to_yaml_str
     return YamlHelper.to_yaml(yaml_object)
   File "/workspace/.venv/lib/python3.10/site-packages/soda/common/yaml_helper.py", line 24, in to_yaml
     cls.__yaml.dump(yaml_object, stream)
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/main.py", line 563, in dump
     self._context_manager.dump(data)
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/main.py", line 913, in dump
     self._yaml.representer.represent(data)
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/representer.py", line 82, in represent
     self.serializer.serialize(node)
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/serializer.py", line 109, in serialize
     self.serialize_node(node, None, None)
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/serializer.py", line 188, in serialize_node
     self.emitter.emit(
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/emitter.py", line 257, in emit
     self.state()
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/emitter.py", line 643, in expect_first_block_sequence_item
     return self.expect_block_sequence_item(first=True)
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/emitter.py", line 663, in expect_block_sequence_item
     self.expect_node(sequence=True)
   File "/workspace/.venv/lib/python3.10/site-packages/ruamel/yaml/emitter.py", line 467, in expect_node
     raise EmitterError('expected NodeEvent, but got {self.event!s}')
 ruamel.yaml.emitter.EmitterError: expected NodeEvent, but got {self.event!s}

Let me know your thoughts, or what else I could do to help. Thank you!

tools-soda commented 2 months ago

SAS-3303

bastienboutonnet commented 2 months ago

I think the best at this stage would be to follow your intuition and see if indeed you can make it work, then we'll gladly welcome a PR. At this point, using the package in a fastAPI app hasn't really been a use-case we've tested or given much thought about so if you think you're on to something we're open to support you through that PR.

Otherwise, I'll be more than happy to log-in a feature request for you and understand a bit more the use-case and so on.

Electrogenic commented 2 months ago

Gotchya! Good to know that about FastAPI as a use case. We've only just started the integration. So nothing in prod yet. But happy to say it's been pretty smooth sailing thus far. Save for this one thorny issue.

I'd be happy to take a crack at it though, and see if I can't figure out what's going on here.

m1n0 commented 2 months ago

Hmm I have seen this before with parallel scan executions - we were trying to parallelize partitioned scans, where we loaded config and checks from files and then passed those into multiple parallel scans. I did not find a quick root-cause fix but I found a way to make it work - I started adding config and checks as strings to scans (i.e. separate object instances) instead of one file stream object. This solved the issue for me, but it seems like you are already doing that. Did you by any chance get the error when you used files instead of yaml strings?

m1n0 commented 1 month ago

Any feedback or success on this issue @Electrogenic ?

Electrogenic commented 1 month ago

Any feedback or success on this issue @Electrogenic ?

@m1n0 Argh I'm terribly sorry. I haven't been look into this one since initially opening the issue. For our project we ended up writing some custom schema validators running in the context of Fast API.

I'm not too certain that I'll have the time to dig into this further. Though I'll do my best.