langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.49k stars 14.8k forks source link

CSVLoader does not raise `ValueError` for missing metadata columns #26434

Open m3et opened 6 days ago

m3et commented 6 days ago

Checked other resources

Example Code

Steps to Reproduce

  1. Create a CSV file (e.g., demo_bug.csv):

    HEADER1, HEADER2, HEADER3
    data1, data2, data3
    data4, data5, data6
  2. Use the following Python code to load the CSV:

    from langchain_community.document_loaders import CSVLoader
    
    loader = CSVLoader(
        file_path="./demo_bug.csv",
        metadata_columns=("MISSING_HEADER", "HEADER1", "HEADER2", "HEADER3"),
    )
    
    loader.load()
  3. You will get the following traceback:

    Traceback (most recent call last):
      File "bugCSV.py", line 8, in <module>
        loader.load()
      File "base.py", line 30, in load
        return list(self.lazy_load())
               ^^^^^^^^^^^^^^^^^^^^^^
      File "csv_loader.py", line 147, in lazy_load
        raise RuntimeError(f"Error loading {self.file_path}") from e
    RuntimeError: Error loading ./demo_bug.csv

Expected Behavior

When a metadata column specified in metadata_columns does not exist in the CSV file, I expected the loader to raise a ValueError with a message like:

ValueError: Metadata column 'MISSING_HEADER' not found in CSV file.

Instead, the current implementation raises a generic RuntimeError, making it harder to debug the specific cause of the issue.

Error Message and Stack Trace (if applicable)

No response

Description

In the current implementation of CSVLoader within langchain_community.document_loaders.csv_loader, a generic RuntimeError is raised when an error occurs while loading the CSV file, even when the underlying issue is due to missing metadata columns. This masks the actual problem, making debugging more difficult for users.

Specifically, when a column specified in the metadata_columns parameter is not present in the CSV file, a more appropriate ValueError should be raised, indicating the missing column. However, due to broad exception handling in the lazy_load() method, this specific error is hidden behind a RuntimeError.

Expected Behavior

When a metadata column specified by the user is missing from the CSV file, the loader should raise a ValueError, providing a clear message about the missing column, instead of the generic RuntimeError.

Actual Behavior

A generic RuntimeError is raised, which does not specify that the issue stems from a missing column in the CSV file. This makes it difficult for users to identify the root cause of the problem.

Proposed Solution

The error handling in the lazy_load() method should be adjusted to allow more specific exceptions, such as ValueError, to propagate. This will ensure that the appropriate error is raised and presented to the user when metadata columns are missing.

the appropriate error is raised and presented to the user when metadata columns are missing.

def lazy_load(self) -> Iterator[Document]:
    try:
        with open(self.file_path, newline="", encoding=self.encoding) as csvfile:
            yield from self.__read_file(csvfile)
    except UnicodeDecodeError as e:
        if self.autodetect_encoding:
            detected_encodings = detect_file_encodings(self.file_path)
            for encoding in detected_encodings:
                try:
                    with open(
                        self.file_path, newline="", encoding=encoding.encoding
                    ) as csvfile:
                        yield from self.__read_file(csvfile)
                        break
                except UnicodeDecodeError:
                    continue
        else:
            raise RuntimeError(f"Error loading {self.file_path}") from e
    except ValueError as ve:  # Allow ValueError to propagate
        raise ve
    except Exception as e:
        raise RuntimeError(f"Error loading {self.file_path}") from e

System Info

Environment

m3et commented 6 days ago

FYI, I would like to introduce the fix.