Closed matthiasgomolka closed 2 months ago
hi there, I am not able to reproduce this issue.
here is how i set mine up:
import great_expectations as gx
context = gx.get_context(mode="file")
# Define the Data Source's parameters (This path is relative to the `base_directory` of the Data Context):
source_folder = "../"
data_source_name = "filesystem_data_source"
# Create the Data Source:
data_source = context.data_sources.add_pandas_filesystem(
name=data_source_name, base_directory=source_folder
)
# Define the Data Asset's parameters:
asset_name = "parquet_file"
# Add the Data Asset to the Data Source:
add_asset = data_source.add_parquet_asset(name=asset_name)
data_asset_name = "file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(asset_name)
batch_definition_name = "batch-def"
batch_definition_path = "gx-project-334/characters.parquet"
batch_definition = file_data_asset.add_batch_definition_path(
name=batch_definition_name, path=batch_definition_path
)
can you try using an absolute path just to ensure that is the correct path based on the base directory? or outline what is your file hierarchy and where is your data?
I already tried using absolute paths. But I think I know the problem by now. I managed to define a .add_batch_definition_yearly()
with a regex. When I did this, I figured out, that path delimiters must be specified like in Windows -> \ instead of /.
Haven't tried it yet, but I would assume this also holds for .add_batch_definition_path()
.
after further review, i am seeing something i would believe to be unexpected. let me pass this along and follow up with you. although i am not seeing the error you have, which is indicating an incorrect path, i am seeing this now:
No file at base_directory path "/Users/x/x/x/x/gx-project-334/characters.parquet" matched glob_directive "**/*" for DataAsset "parquet_file".
When the file certainly exists and matches the glob pattern. while i investigate further, can you confirm if what you tried resulted in something different?
I tried again yesterday an was then able to define a batch using add_batch_definition_path()
. I think, there are several convoluting issues:
re
library.okay i am glad you were able to be unblocked, i will make a note of this so we can update our documentation - thank you
Describe the bug I want to add batches to a pandas file system asset (a parquet file) as described here: https://docs.greatexpectations.io/docs/core/connect_to_data/filesystem_data/?batch_definition=path&partition_type=yearly
But I get an error which states that the
This seems like a bug, because the file definitely exists (verified with
Pathlib.Path().exists()
).To Reproduce Please include your great_expectations.yml config, the code you’re executing that causes the issue, and the full stack trace of any error(s).
great_expectations.yml
```yaml # Welcome to Great Expectations! Always know what to expect from your data. # # Here you can define datasources, batch kwargs generators, integrations and # more. This file is intended to be committed to your repo. For help with # configuration please: # - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource # - Join our slack channel: http://greatexpectations.io/slack # config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility # It is auto-generated and usually does not need to be changed. config_version: 4.0 # This config file supports variable substitution which enables: 1) keeping # secrets out of source control & 2) environment-based configuration changes # such as staging vs prod. # # When GX encounters substitution syntax (like `my_key: ${my_value}` or # `my_key: $my_value`) in the great_expectations.yml file, it will attempt # to replace the value of `my_key` with the value from an environment # variable `my_value` or a corresponding key read from this config file, # which is defined through the `config_variables_file_path`. # Environment variables take precedence over variables defined here. # # Substitution values defined here can be a simple (non-nested) value, # nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR}) # # # https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials config_variables_file_path: uncommitted/config_variables.yml # The plugins_directory will be added to your python path for custom modules # used to override and extend Great Expectations. plugins_directory: plugins/ stores: # Stores are configurable places to store things like Expectations, Validations # Data Docs, and more. These are for advanced users only - most users can simply # leave this section alone. expectations_store: class_name: ExpectationsStore store_backend: class_name: TupleFilesystemStoreBackend base_directory: expectations/ validation_results_store: class_name: ValidationResultsStore store_backend: class_name: TupleFilesystemStoreBackend base_directory: uncommitted/validations/ checkpoint_store: class_name: CheckpointStore store_backend: class_name: TupleFilesystemStoreBackend suppress_store_backend_id: true base_directory: checkpoints/ validation_definition_store: class_name: ValidationDefinitionStore store_backend: class_name: TupleFilesystemStoreBackend base_directory: validation_definitions/ expectations_store_name: expectations_store validation_results_store_name: validation_results_store checkpoint_store_name: checkpoint_store data_docs_sites: # Data Docs make it simple to visualize data quality in your project. These # include Expectations, Validations & Profiles. The are built for all # Datasources from JSON artifacts in the local repo including validations & # profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs local_site: class_name: SiteBuilder show_how_to_buttons: true store_backend: class_name: TupleFilesystemStoreBackend base_directory: uncommitted/data_docs/local_site/ site_index_builder: class_name: DefaultSiteIndexBuilder analytics_enabled: true fluent_datasources: my_ds: type: pandas_filesystem id: b4b3e487-73bb-4eb9-84f7-8f4003de3afe assets: my_asset: type: parquet id: f7b5139a-9b62-41f0-a899-d47f7cbc122c base_directory: my_path data_context_id: 1b8ad082-88c4-49cb-86c9-fd9d2eaa7cd2 ```My Code:
Stacktrace:
Expected behavior I should be able to add batches based on existing file paths.
Environment (please complete the following information):
Additional context Add any other context about the problem here.