langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.27k stars 14.74k forks source link

[BUG] Inconsistent results with `RecursiveCharacterTextSplitter`'s `add_start_index=True` #16579

Closed antoniolanza1996 closed 7 months ago

antoniolanza1996 commented 7 months ago

Checked other resources

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

text1 = """Outokumpu Annual report 2019 | Sustainability review 23 / 24 • For business travel: by estimated driven kilometers with emissions factors for the car, and for flights by CO2 eq. reports of the flight companies. Rental car emissions are included by the rental car company report. Upstream transport was assessed on data of environmental product declaration of 2019 but excluded from scope 3 emissions. The recycled content is calculated as the sum of pre and post consumer scrap related to crude steel production. Additionally, we report on the recycled content including all recycled metals from own treated waste streams entering the melt shop. Energy efficiency is defined as the sum of specific fuel and electricity energy of all processes calculated as energy consumption compared to the product output of that process. It covers all company productions: ferrochrome, melt shop, hot rolling and cold rolling processes. Used heat values and the consumption of energy are taken from supplier's invoices. Water withdrawal is measured for surface water, taken from municipal suppliers and estimated for rainwater amount. Waste is separately reported for mining and stainless production. In mining, amount of non-hazardous tailing sands is reported. For stainless production hazardous and non-hazardous wastes are reported as recycled, recovered and landfilled. Waste treated is counted as landfilled waste. Social responsibility Health and safety figures Health and safety figures reflect the scope of Outokumpu’s operations as they were in 2019. Safety indicators (accidents and preventive safety actions) are expressed per million hours worked (frequency). Safety indicators include Outokumpu employees, persons employed by a third party (contractor) or visitor accidents and preventive safety actions. A workplace accident is the direct result of a work-related activity and it has taken place during working hours at the workplace. Accident types • Lost time injury (LTI) is an accident that caused at least one day of sick leave (excluding the day of the injury or accident), as the World Steel Association defines it. One day of sick leave means that the injured person has not been able to return to work on their next scheduled period of working or any future working day if caused by an outcome of the original accident. Lost-day rate is defined as more than one calendar day absence from the day after the accident per million working hours. • Restricted work injury (RWI) does not cause the individual to be absent, but results in that person being restricted in their capabilities so that they are unable to undertake their normal duties. • Medically treated injury (MTI) has to be treated by a medical professional (doctor or nurse). • First aid treated injury (FTI), where the injury did not require medical care and was treated by a person himself/herself or by first aid trained colleague. • Total recordable injury (TRI) includes fatalities, LTIs, RWIs and MTIs, but FTIs are excluded. • All workplace accidents include total recordable injuries (TRI) and first aid treated injuries (FTI) Proactive safety actions Hazards refer to events, situations or actions that could have led to an accident, but where no injury occurred. Safety behavior observations (SBOs) are safety-based discussions between an observer and the person being observed. Other preventive safety action includes proactive measures. Sick-leave hours and absentee rate Sick-leave hours reported are total sick leave hours during a reporting period. Reporting units provide data on absence due to illness, injury and occupational diseases on a monthly basis. The absentee rate (%) includes the actual absentee hours lost expressed as a percentage of total hours scheduled. Total personnel costs This figure includes wages, salaries, bonuses, social costs or other personnel expenses, as well as fringe benefits paid and/ or accrued during the reporting period. Training costs Training costs include external training-related expenses such as participation fees. Wages, salaries and daily allowances for participants in training activities are not included, but the salaries of internal trainers are included. Training days per employee The number of days spent by an employee in training when each training day is counted as lasting eight hours. Bonuses A bonus is an additional payment for good performance. These figures are reported without social costs or fringe benefits. Personnel figures Rates are calculated using the total employee numbers at the end of the reporting period. The calculations follow the requirements of GRI Standards. The following calculation has been applied e.g. Hiring rate = New Hires / total number of permanent employees by year-end Average turnover rate = (Turnover + New Hires) / (total number of permanent employees by year-end × 2) Days lost due to strikes The number of days lost due to strikes is calculated by multiplying the number of Outokumpu employees who have been on strike by the number of scheduled working days lost. The day on which a strike starts is included. n Scope of the report"""
text2 = text1 + "a"

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""],
    add_start_index=True,
)
new_passages = text_splitter.split_documents([Document(page_content=text1)])
for passage in new_passages:
    passage.metadata['end_index'] = passage.metadata['start_index'] + len(passage.page_content)
print([(p.metadata['start_index'], p.metadata['end_index']) for p in new_passages])
>>> [(0, 1022), (1023, 2044), (2045, 3068), (3069, 4087), (4088, 5111), (4412, 4418)]

new_passages = text_splitter.split_documents([Document(page_content=text2)])
for passage in new_passages:
    passage.metadata['end_index'] = passage.metadata['start_index'] + len(passage.page_content)
print([(p.metadata['start_index'], p.metadata['end_index']) for p in new_passages])
>>> [(0, 1022), (1023, 2044), (2045, 3068), (3069, 4087), (4088, 5111), (5112, 5119)]

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to use RecursiveCharacterTextSplitter with add_start_index=True but I found out some texts where the start_index is wrong. For example:

System Info

langchain 0.0.334 with python 3.8

dosubot[bot] commented 7 months ago

🤖

Hello @antoniolanza1996! Good to see you again. Thanks for diving deep into the LangChain and coming up with such detailed observations. Let's get to the bottom of this!

Thank you for bringing this issue to our attention. The problem you're experiencing with the RecursiveCharacterTextSplitter in LangChain v0.0.334 is due to the way the start_index is calculated in the create_documents method.

In the current implementation, the start_index is calculated using the find method on the original text with the chunk as the parameter. This method returns the first occurrence of the chunk in the text, which can lead to incorrect start_index values if the chunk appears more than once in the text.

Here is the problematic code:

def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        """Create documents from a list of texts."""
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            index = -1
            for chunk in self.split_text(text):
                metadata = copy.deepcopy(_metadatas[i])
                if self._add_start_index:
                    index = text.find(chunk, index + 1)
                    metadata["start_index"] = index
                new_doc = Document(page_content=chunk, metadata=metadata)
                documents.append(new_doc)
        return documents

To resolve this issue, you can modify the create_documents method to calculate the start_index based on the length of the chunks instead of using the find method. This will ensure that the start_index is correctly calculated for each chunk, even if the chunk appears more than once in the text.

Here is the modified code:

def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        """Create documents from a list of texts."""
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            index = 0
            for chunk in self.split_text(text):
                metadata = copy.deepcopy(_metadatas[i])
                if self._add_start_index:
                    metadata["start_index"] = index
                    index += len(chunk)
                new_doc = Document(page_content=chunk, metadata=metadata)
                documents.append(new_doc)
        return documents

This modification ensures that the start_index is correctly calculated for each chunk, preventing overlaps in the text passages.

Please note that this is a workaround and not an official fix. The LangChain team will need to address this issue in a future release.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

antoniolanza1996 commented 7 months ago

Nice catch for the bug: it was the wrong use of find...

However, the provided solution is wrong (let's also consider that there is the strip_whitespace parameter can strip the whitespaces).

A possible solution has been pushed in PR #16583

antoniolanza1996 commented 7 months ago

PR #16583 merged. This issue is fixed now