Feature Request - API Call Parameters to set chunk minimum and maximum length.

leohpark commented 8 months ago

Hello, looks great so far. Would appreciate the ability to include parameters in the API call that specify both a minimum and maximum character length for the resulting chunks.

Minimum chunk size would look across the resulting chunk objects and do simple concatenation until they are over some value for total text length.
Maximum chunk size would split chunk text into multiple segments while preserving the title/section smart labeling you already do.

ansukla commented 8 months ago

Agreed! Will add it to the list of things and happy to take PRs if anyone wants to pick this up.

amn-max commented 5 months ago

Hi there!. Thought you might find the following code snippet helpful. Please note that this is a basic example and might need further testing for edge cases. It aims to demonstrate how you can set both minimum and maximum character length for resulting chunks.

# Set minimum chunk size
min_chunk_size = 200
# Set maximum chunk size
chunk_size = 350

# Initialize empty lists to store the final chunks, accumulated text, and bounding box coordinates
final_chunks = []
texts = ""
cords_dict = []

# Iterate over the document chunks
for chunk in doc.chunks():
    # Extract the text and bounding box coordinates for the current chunk
    chunk_text = chunk.to_context_text()
    chunk_cords = chunk.block_json

    # Check if adding the current chunk to the accumulated text exceeds the maximum chunk size
    if len(texts) + len(chunk_text) <= chunk_size:
        # If within the limit, append the text and coordinates to the current chunk
        texts += chunk_text
        cords_dict.append(chunk_cords)
    # Check if the accumulated text meets or exceeds the minimum chunk size
    elif len(texts) >= min_chunk_size:
        # If yes, create a new chunk with the accumulated text and coordinates
        final_chunks.append({
            'context_text': texts,
            'cords': cords_dict
        })
        # Reset the accumulated text and coordinates for the new chunk
        texts = chunk_text
        cords_dict = [chunk_cords]
    else:
        # If adding the chunk would make the current chunk size less than min_chunk_size,
        # skip to the next chunk without creating a new chunk
        texts = chunk_text
        cords_dict = [chunk_cords]

# Add the last chunk if it has content and meets the minimum chunk size
if len(texts) >= min_chunk_size:
    final_chunks.append({
        'context_text': texts,
        'cords': cords_dict
    })

@ansukla What do you think will this work? or will it degrade the chunk quality?

nlmatics / llmsherpa

Feature Request - API Call Parameters to set chunk minimum and maximum length. #16