microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
20k stars 1.96k forks source link

[Bug]: CSV loading not working on blob storage #497

Closed AlonsoGuevara closed 4 months ago

AlonsoGuevara commented 4 months ago

Describe the bug

I can't pull csv files from blob storage, it fails saying that data is empty

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

input: type: blob file_type: csv source_column: ${GRAPHRAG_INPUT_SOURCE_COLUMN} timestamp_column: ${GRAPHRAG_INPUT_TIMESTAMP_COLUMN} timestamp_format: ${GRAPHRAG_INPUT_TIMESTAMP_FORMAT} storage_type: blob storage_account_blob_url: ${GRAPHRAG_STORAGE_ACCOUNT_NAME} container_name: ${GRAPHRAG_CONTAINER_NAME} document_attribute_columns: ${GRAPHRAG_INPUT_TEXT_ATTRIBUTE_COLUMNS}

Logs and screenshots


EmptyDataError Traceback (most recent call last) File , line 51 48 start_time = time.time() 49 logger_util.print_and_log("Starting pipeline run") ---> 51 async for result in run_pipeline_with_config(pipelineConfig, debug=True): 52 print(result) 54 end_time = time.time()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphrag/index/run.py:144, in run_pipeline_with_config(config_or_path, workflows, dataset, storage, cache, callbacks, progress_reporter, input_post_process_steps, additional_verbs, additional_workflows, emit, memory_profile, run_id, is_resume_run, **_kwargs) 142 cache = cache or _create_cache(config.cache) 143 callbacks = callbacks or _create_reporter(config.reporting) --> 144 dataset = dataset if dataset is not None else await _create_input(config.input) 145 post_process_steps = input_post_process_steps or _create_postprocess_steps( 146 config.input 147 ) 148 workflows = workflows or config.workflows

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphrag/index/run.py:133, in run_pipeline_with_config.._create_input(config) 130 if config is None: 131 return None --> 133 return await load_input(config, progress_reporter, root_dir)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphrag/index/input/load_input.py:81, in load_input(config, progress_reporter, root_dir) 77 progress = progress_reporter.child( 78 f"Loading Input ({config.file_type})", transient=False 79 ) 80 loader = loaders[config.file_type] ---> 81 results = await loader(config, progress, storage) 82 return cast(pd.DataFrame, results) 84 msg = f"Unknown input type {config.file_type}"

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphrag/index/input/csv.py:126, in load(config, progress, storage) 123 msg = f"No CSV files found in {config.base_dir}" 124 raise ValueError(msg) --> 126 files = [await load_file(file, group) for file, group in files] 127 log.info("loading %d csv files", len(files)) 128 result = pd.concat(files)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphrag/index/input/csv.py:126, in (.0) 123 msg = f"No CSV files found in {config.base_dir}" 124 raise ValueError(msg) --> 126 files = [await load_file(file, group) for file, group in files] 127 log.info("loading %d csv files", len(files)) 128 result = pd.concat(files)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphrag/index/input/csv.py:38, in load..load_file(path, group) 36 group = {} 37 buffer = BytesIO(await storage.get(path, as_bytes=True)) ---> 38 data = pd.read_csv(buffer, encoding=config.encoding or "latin-1") 39 additional_keys = group.keys() 40 if len(additional_keys) > 0:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 1013 kwds_defaults = _refine_defaults_read( 1014 dialect, 1015 delimiter, (...) 1022 dtype_backend=dtype_backend, 1023 ) 1024 kwds.update(kwds_defaults) -> 1026 return _read(filepath_or_buffer, kwds)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds) 617 _validate_names(kwds.get("names", None)) 619 # Create the parser. --> 620 parser = TextFileReader(filepath_or_buffer, **kwds) 622 if chunksize or iterator: 623 return parser

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.init(self, f, engine, **kwds) 1617 self.options["has_index_names"] = kwds["has_index_names"] 1619 self.handles: IOHandles | None = None -> 1620 self._engine = self._make_engine(f, self.engine)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1898, in TextFileReader._make_engine(self, f, engine) 1895 raise ValueError(msg) 1897 try: -> 1898 return mapping[engine](f, **self.options) 1899 except Exception: 1900 if self.handles is not None:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:93, in CParserWrapper.init(self, src, kwds) 90 if kwds["dtype_backend"] == "pyarrow": 91 # Fail here loudly instead of in cython after reading 92 import_optional_dependency("pyarrow") ---> 93 self._reader = parsers.TextReader(src, kwds) 95 self.unnamed_cols = self._reader.unnamed_cols 97 # error: Cannot determine type of 'names'

File parsers.pyx:581, in pandas._libs.parsers.TextReader.cinit()

EmptyDataError: No columns to parse from file

Additional Information

AlonsoGuevara commented 4 months ago

After testing and finding a way to reproduce can confirm this is not an issue. Resolving