Open zwsjink opened 1 year ago
tried setting ENDPOINT via os.environ and the error becomes different :
23/07/15 05:03:41 INFO DAGScheduler: Job 2 finished: count at
:0, took 0.282508 s count: 5177571 Starting the downloading of this file Sharding file number 1 of 1 called /mybucket/part-00000-2256f782-126f-4dc6-b9c6-e6757637749d-c000.snappy.parquet Traceback (most recent call last): File "/opt/spark/work-dir/download.py", line 49, in download( File "/usr/local/lib/python3.10/dist-packages/img2dataset/main.py", line 250, in download distributor_fn( File "/usr/local/lib/python3.10/dist-packages/img2dataset/distributor.py", line 64, in pyspark_distributor failed_shards = run(reader) File "/usr/local/lib/python3.10/dist-packages/img2dataset/distributor.py", line 57, in run for batch in batcher(gen, subjob_size): File "/usr/local/lib/python3.10/dist-packages/img2dataset/distributor.py", line 52, in batcher for first in iterator: File "/usr/local/lib/python3.10/dist-packages/img2dataset/reader.py", line 183, in iter shards, number_shards = self._save_to_arrow(input_file, start_shard_id) File "/usr/local/lib/python3.10/dist-packages/img2dataset/reader.py", line 95, in _save_to_arrow with self.fs.open(input_file, mode="rb") as file: File "/usr/local/lib/python3.10/dist-packages/fsspec/spec.py", line 1199, in open f = self._open( File "/usr/local/lib/python3.10/dist-packages/ossfs/base.py", line 257, in _open return OSSFile( File "/usr/local/lib/python3.10/dist-packages/fsspec/spec.py", line 1555, in init self.size = self.details["size"] File "/usr/local/lib/python3.10/dist-packages/fsspec/spec.py", line 1568, in details self._details = self.fs.info(self.path) File "/usr/local/lib/python3.10/dist-packages/ossfs/utils.py", line 68, in wrapper result = func(ossfs, path, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ossfs/core.py", line 559, in info result = super().info(path, kwargs) File "/usr/local/lib/python3.10/dist-packages/fsspec/spec.py", line 648, in info raise FileNotFoundError(path) FileNotFoundError: /mybucket/part-00000-2256f782-126f-4dc6-b9c6-e6757637749d-c000.snappy.parquet Traceback (most recent call last): File " ", line 1, in state) FileNotFoundError: [Errno 2] No such file or directoryFile "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate self._semlock = _multiprocessing.SemLock._rebuild(
it automatically remove 'oss:/' maybe something wrong with ossfs logic
did you fix this ? seems like an issue with fsspec
did you fix this ? seems like an issue with fsspec
nope, later on, I tried solve this issue by download to NAS first and then start from there.
currently, I'm trying to use this piece of code to download images
the print logic on the 3rd line works well (meaning reading a file from OSS is fine and my spark conf is valid), but when it comes to the download part, it throws error like this:
I've already set such spark conf : "spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem" "spark.hadoop.fs.oss.endpoint": "xxxxxxx.aliyuncs.com" "spark.hadoop.fs.oss.accessKeyId": "xxxxxx" "spark.hadoop.fs.oss.accessKeySecret": "xxxxx"
Looks like the img2dataset reader cannot use the same configuration.