langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.87k stars 15.13k forks source link

DirectoryLoader converting characters randomly into new line characters? #23849

Open XariZaru opened 3 months ago

XariZaru commented 3 months ago

Checked other resources

Example Code

I use the following code to load my documents.

def load_documents(directory):
    SOURCE_DOCUMENTS_DIR = directory
    SOURCE_DOCUMENTS_FILTER = "**/*.txt"

    loader = DirectoryLoader(f"{SOURCE_DOCUMENTS_DIR}", glob=SOURCE_DOCUMENTS_FILTER, show_progress=True, use_multithreading=True)
    print(f"Loading {SOURCE_DOCUMENTS_DIR} directory: ", end="")
    data = loader.load()
    print(f"Splitting {len(data)} documents")
    return data

Error Message and Stack Trace (if applicable)

No response

Description

The following is a line from a text document I am loading. This is how it looks in Notepad. Document Name: https://www.kinecta.org//about-us/executive-staff

When I load the document using DirectoryLoader (I load a list of other docs as well), and print out the doc.page_content, I get the following:

page_content='Document Name: https://www.kinecta.org//about\n\nus/executive\n\nstaff\n\n'

As you can see, it converted the dashes into new line characters. Any idea what this is?

This is the code I use to load my documents.

System Info

Python 3.11 Langchain 0.1.12

williambohrmann3 commented 3 weeks ago

We are hitting this exact bug as well, the conversion of hyphens into newlines in links. Thank you for opening up this bug!

VaibhavLakshmiS commented 4 days ago

Hi there! We are a group of 3 students from the University of Toronto and we are very interested in fixing this issue and also adding some tests. We will submit a PR for this issue by end of November. Thank you!