Open kisel4363 opened 2 years ago
Which version of pyarrow are you using?
Hi
I'm having the same problem when using Reader
class.
I'm using petastorm version 0.11.5 and pyarrow version 9.0.0.
Please try using fs = FilesystemResolver(cache_dir_url).filesystem()
to get a filesystem that should be compatible with the Reader
. The pyarrow.fs instance of a filesystem can not be currently used with petastorm as it requires it to support pyarrow ParquetDataset v2, which is not supported just yet. See: #613.
Hi @selitvin , I'm getting the same error when trying to write a petastorm dataset to cloud storage (ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API
) and this workaround doesn't seem to work for me - except I'm using resolver.filesystem_factory()
instead of resolver.filesystem()
to fit the materialise_dataset()
API.
I am using
from petastorm.fs_utils import FilesystemResolver
from petastorm.etl.dataset_metadata import materialize_dataset
dataset_url = "gs://bucket/dataset-name"
resolver = FilesystemResolver(dataset_url)
with materialize_dataset(spark, dataset_url, schema, filesystem_factory=resolver.filesystem_factory()):
# spark logic here
# essentially the same as the example in the docs
I'm not sure whether your suggestion isn't working for me due to the fact that I'm writing rather than reading, due to the time that's passed since you originally posted(, or whether I'm doing something stupid!). Do you know how I can resolve this?
I'm also getting the same error despite using using the file system resolve (which results an s3fs filesystem.
from petastorm import make_batch_reader
from petastorm.tf_utils import make_petastorm_dataset
from petastorm.fs_utils import FilesystemResolver
s3_url = f"s3://{S3_BASE}"
fs = FilesystemResolver(f"s3://{S3_BASE}").filesystem()
print(type(fs))
with make_batch_reader(s3_url, filesystem=fs) as reader:
dataset = make_petastorm_dataset(reader)
iterator = dataset.make_one_shot_iterator()
tensor = iterator.get_next()
Output:
<class 's3fs.core.S3FileSystem'>
Trace:
File [~/pln_bin/quickstart/py_env/quickstart/lib/python3.11/site-packages/petastorm/reader.py:300](https://vscode-remote+ssh-002dremote-002bappl644-002dny2.vscode-resource.vscode-cdn.net/space/nyoder/notebooks/nyoder/mf/~/pln_bin/quickstart/py_env/quickstart/lib/python3.11/site-packages/petastorm/reader.py:300), in make_batch_reader(dataset_url_or_urls, schema_fields, reader_pool_type, workers_count, results_queue_size, seed, shuffle_rows, shuffle_row_groups, shuffle_row_drop_partitions, predicate, rowgroup_selector, num_epochs, cur_shard, shard_count, shard_seed, cache_type, cache_location, cache_size_limit, cache_row_size_estimate, cache_extra_settings, hdfs_driver, transform_spec, filters, storage_options, zmq_copy_buffers, filesystem)
292 filesystem, dataset_path_or_paths = get_filesystem_and_path_or_paths(
293 dataset_url_or_urls,
294 hdfs_driver,
295 storage_options=storage_options,
296 filesystem=filesystem
297 )
299 try:
--> 300 dataset_metadata.get_schema_from_dataset_url(dataset_url_or_urls, hdfs_driver=hdfs_driver,
301 storage_options=storage_options, filesystem=filesystem)
302 warnings.warn('Please use make_reader (instead of \'make_batch_dataset\' function to read this dataset. '
303 'You may get unexpected results. '
304 'Currently make_batch_reader supports reading only Parquet stores that contain '
305 'standard Parquet data types and do not require petastorm decoding.')
306 except PetastormMetadataError:
File [~/pln_bin/quickstart/py_env/quickstart/lib/python3.11/site-packages/petastorm/etl/dataset_metadata.py:402](https://vscode-remote+ssh-002dremote-002bappl644-002dny2.vscode-resource.vscode-cdn.net/space/nyoder/notebooks/nyoder/mf/~/pln_bin/quickstart/py_env/quickstart/lib/python3.11/site-packages/petastorm/etl/dataset_metadata.py:402), in get_schema_from_dataset_url(dataset_url_or_urls, hdfs_driver, storage_options, filesystem)
389 """Returns a :class:`petastorm.unischema.Unischema` object loaded from a dataset specified by a url.
390
391 :param dataset_url_or_urls: a url to a parquet directory or a url list (with the same scheme) to parquet files.
(...)
396 :return: A :class:`petastorm.unischema.Unischema` object
397 """
398 fs, path_or_paths = get_filesystem_and_path_or_paths(dataset_url_or_urls, hdfs_driver,
399 storage_options=storage_options,
400 filesystem=filesystem)
--> 402 dataset = pq.ParquetDataset(path_or_paths, filesystem=fs, validate_schema=False, metadata_nthreads=10)
404 # Get a unischema stored in the dataset metadata.
405 stored_schema = get_schema(dataset)
File [~/pln_bin/quickstart/py_env/quickstart/lib/python3.11/site-packages/pyarrow/parquet/core.py:1776](https://vscode-remote+ssh-002dremote-002bappl644-002dny2.vscode-resource.vscode-cdn.net/space/nyoder/notebooks/nyoder/mf/~/pln_bin/quickstart/py_env/quickstart/lib/python3.11/site-packages/pyarrow/parquet/core.py:1776), in ParquetDataset.__new__(cls, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset, pre_buffer, coerce_int96_timestamp_unit, thrift_string_size_limit, thrift_container_size_limit)
1773 use_legacy_dataset = False
1775 if not use_legacy_dataset:
-> 1776 return _ParquetDatasetV2(
1777 path_or_paths, filesystem=filesystem,
1778 filters=filters,
1779 partitioning=partitioning,
1780 read_dictionary=read_dictionary,
1781 memory_map=memory_map,
1782 buffer_size=buffer_size,
1783 pre_buffer=pre_buffer,
1784 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
1785 # unsupported keywords
1786 schema=schema, metadata=metadata,
1787 split_row_groups=split_row_groups,
1788 validate_schema=validate_schema,
1789 metadata_nthreads=metadata_nthreads,
1790 thrift_string_size_limit=thrift_string_size_limit,
1791 thrift_container_size_limit=thrift_container_size_limit,
1792 )
1793 warnings.warn(
1794 "Passing 'use_legacy_dataset=True' to get the legacy behaviour is "
1795 "deprecated as of pyarrow 11.0.0, and the legacy implementation "
1796 "will be removed in a future version." + extra_msg,
1797 FutureWarning, stacklevel=2)
1798 self = object.__new__(cls)
File [~/pln_bin/quickstart/py_env/quickstart/lib/python3.11/site-packages/pyarrow/parquet/core.py:2407](https://vscode-remote+ssh-002dremote-002bappl644-002dny2.vscode-resource.vscode-cdn.net/space/nyoder/notebooks/nyoder/mf/~/pln_bin/quickstart/py_env/quickstart/lib/python3.11/site-packages/pyarrow/parquet/core.py:2407), in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
2403 for keyword, default in [
2404 ("metadata", None), ("split_row_groups", False),
2405 ("validate_schema", True), ("metadata_nthreads", None)]:
2406 if keyword in kwargs and kwargs[keyword] is not default:
-> 2407 raise ValueError(
2408 "Keyword '{0}' is not yet supported with the new "
2409 "Dataset API".format(keyword))
2411 # map format arguments
2412 read_options = {
2413 "pre_buffer": pre_buffer,
2414 "coerce_int96_timestamp_unit": coerce_int96_timestamp_unit,
2415 "thrift_string_size_limit": thrift_string_size_limit,
2416 "thrift_container_size_limit": thrift_container_size_limit,
2417 }
ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API
Here is how i fixed it .. I am using s3 minIO to cache directory for petastorm. Objective was to create tf dataset from the petastorm cache data in minIO. Since petastorm uses pyarrow underneath, there has been a dependency issues. Follow these steps:
write your aws credentials to file /.aws/credentials # use default profile
pip install pyarrow==10.0.1 dont use any other version greater than this one. Version >10.0.1 , doesnot support keyword validation_schema , i do not know why!!
pip install petastorm==0.12.1
from petastorm.fs_utils import FilesystemResolver from petastorm.spark import SparkDatasetConverter, make_spark_converter import pyarrow.parquet as pq
path_or_paths='s3a://bucket/key' # dont mention the parquet file pq.ParquetDataset(path_or_paths= path_or_paths,filesystem=FilesystemResolver(path_or_paths).filesystem() ,validate_schema=False).read().to_pandas() # do whatever you want to do after this
pip install pyarrow==10.0.1 only support D_GLIBCXX_USE_CXX11_ABI=0
when build with gcc version >=6 default abi would be D_GLIBCXX_USE_CXX11_ABI =1
import pyarrow.so will get an error
Hi, Im using petastorm to feed tensorflow models lunched with spark in an EMR cluster. The code is the basic to read parquet files on s3:
It throw the next error:
How can be solved this issue? Thanks