Closed kutnyakhov closed 2 years ago
From the description of the problem, I cannot understand what is going on. Could you copy-paste the full error message?
It is not creating a parquet file for those runs and raises an error of FileNotFound. Here complete error message:
FileNotFoundError Traceback (most recent call last)
Input In [19], in
File ~/hextof-processor/src/processor/LabDataframeCreator.py:203, in LabDataframeCreator.readRuns(self, run_numbers, path, parquet_path) 198 avaliable_runs = self.getAvailableRuns(path=path) 199 files_to_read = [ 200 filepath for run_number,filepath in avaliable_runs.items() 201 if run_number in run_numbers 202 ] --> 203 return self.readData(files_to_read,parquet_path=parquet_path)
File ~/hextof-processor/src/processor/LabDataframeCreator.py:312, in LabDataframeCreator.readData(self, files_to_read, parquet_path)
310 else:
311 print(f'Loading {len(self.prq_names)} dataframes. Failed reading {len(self.filenames)-len(self.prq_names)} files.')
--> 312 self.dfs = [dd.read_parquet(fn) for fn in self.prq_names] # todo skip pandas, as dask only should work
313 self.fillNA()
315 self.dd = dd.concat(self.dfs).repartition(npartitions=len(self.prq_names))
File ~/hextof-processor/src/processor/LabDataframeCreator.py:312, in
--> 312 self.dfs = [dd.read_parquet(fn) for fn in self.prq_names] # todo skip pandas, as dask only should work
313 self.fillNA()
315 self.dd = dd.concat(self.dfs).repartition(npartitions=len(self.prq_names))
File ~/.conda/envs/hextof-lab/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py:400, in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, ignore_metadata_file, metadata_task_size, split_row_groups, chunksize, aggregate_files, kwargs)
397 raise ValueError("read_parquet options require gather_statistics=True")
398 gather_statistics = True
--> 400 read_metadata_result = engine.read_metadata(
401 fs,
402 paths,
403 categories=categories,
404 index=index,
405 gather_statistics=gather_statistics,
406 filters=filters,
407 split_row_groups=split_row_groups,
408 chunksize=chunksize,
409 aggregate_files=aggregate_files,
410 ignore_metadata_file=ignore_metadata_file,
411 metadata_task_size=metadata_task_size,
412 kwargs,
413 )
415 # In the future, we may want to give the engine the
416 # option to return a dedicated element for common_kwargs
.
417 # However, to avoid breaking the API, we just embed this
418 # data in the first element of parts
for now.
419 # The logic below is inteded to handle backward and forward
420 # compatibility with a user-defined engine.
421 meta, statistics, parts, index = read_metadata_result[:4]
File ~/.conda/envs/hextof-lab/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py:862, in FastParquetEngine.read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, chunksize, aggregate_files, ignore_metadata_file, metadata_task_size, **kwargs)
844 @classmethod
845 def read_metadata(
846 cls,
(...)
860
861 # Stage 1: Collect general dataset information
--> 862 dataset_info = cls._collect_dataset_info(
863 paths,
864 fs,
865 categories,
866 index,
867 gather_statistics,
868 filters,
869 split_row_groups,
870 chunksize,
871 aggregate_files,
872 ignore_metadata_file,
873 metadata_task_size,
874 kwargs,
875 )
877 # Stage 2: Generate output meta
878 meta = cls._create_dd_meta(dataset_info)
File ~/.conda/envs/hextof-lab/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py:473, in FastParquetEngine._collect_dataset_info(cls, paths, fs, categories, index, gather_statistics, filters, split_row_groups, chunksize, aggregate_files, ignore_metadata_file, metadata_task_size, kwargs) 469 else: 470 # Rely on metadata for 0th file. 471 # Will need to pass a list of paths to read_partition 472 scheme = get_file_scheme(fns) --> 473 pf = ParquetFile( 474 paths[:1], open_with=fs.open, root=base, **dataset_kwargs 475 ) 476 pf.file_scheme = scheme 477 pf.cats = paths_to_cats(fns, scheme)
File ~/.conda/envs/hextof-lab/lib/python3.8/site-packages/fastparquet/api.py:113, in ParquetFile.init(self, fn, verify, open_with, root, sep, fs, pandas_nulls) 111 fs = getattr(open_with, "self", None) 112 if isinstance(fn, (tuple, list)): --> 113 basepath, fmd = metadata_from_many(fn, verify_schema=verify, 114 open_with=open_with, root=root, 115 fs=fs) 116 self.fn = join_path(basepath, '_metadata') if basepath \ 117 else '_metadata' 118 self.fmd = fmd
File ~/.conda/envs/hextof-lab/lib/python3.8/site-packages/fastparquet/util.py:179, in metadata_from_many(file_list, verify_schema, open_with, root, fs) 176 elif all(not isinstance(pf, api.ParquetFile) for pf in file_list): 178 if verify_schema or fs is None or len(file_list) < 3: --> 179 pfs = [api.ParquetFile(fn, open_with=open_with) for fn in file_list] 180 else: 181 # activate new code path here 182 f0 = file_list[0]
File ~/.conda/envs/hextof-lab/lib/python3.8/site-packages/fastparquet/util.py:179, in
File ~/.conda/envs/hextof-lab/lib/python3.8/site-packages/fastparquet/api.py:165, in ParquetFile.init(self, fn, verify, open_with, root, sep, fs, pandas_nulls) 163 self.fs = fs 164 else: --> 165 raise FileNotFoundError 166 self.open = open_with 167 self._statistics = None
FileNotFoundError:
I was trying to figure this out. I think it is probably to do with using fastparquet, andby using pyarrow, the problem might be fixed. What do you think Steinn? Such problem happened once before at least.
I am not sure what's happening here, but it seems like the parquet file was not generated. At least, I cannot see it on asap3 in the parquet folder (no parquet for run 65).
If you have seen this before and blame fastparquet
, I have nothing against using pyarrow
.
Can you try and see if that fixes it?
As I've written at the beginning: I've managed to read and create a parquet for one of those files once, but then, while reading the next file, I've got an error and deleted the parquet from an already analysed run. Since that, I was not able to read any of those runs and no new parquets were created - I know it sounds strange but I can show the notebook later today with analysed data and error message appeared later :)
sounds like a bug on our side, probably in file naming and file detections. Lets have a look at it later then!
I tried to reproduce this but couldn't as I have collaborator status on the beamtime, which only grants read rights.
I've changed you to "participant" :)
Error comes from channel name having a '.' symbol. Make an error message for such case.
Somehow, it is impossible to read all files (all above scan65) from the same lab beamtime. Rising error message is FileNotFoundError during parquet creation. Interestingly I've managed to read one of those files (scan69) and started to analyze it, but I've got an error while reading the next similar file. For test purposes, I've deleted just created parquet file and rerun the reading, but got an identical error. Restarting the kernel and server didn't help.