nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.17k stars 117 forks source link

Feature Request - API Call Parameters to set chunk minimum and maximum length. #16

Open leohpark opened 8 months ago

leohpark commented 8 months ago

Hello, looks great so far. Would appreciate the ability to include parameters in the API call that specify both a minimum and maximum character length for the resulting chunks.

ansukla commented 8 months ago

Agreed! Will add it to the list of things and happy to take PRs if anyone wants to pick this up.

amn-max commented 5 months ago

Hi there!. Thought you might find the following code snippet helpful. Please note that this is a basic example and might need further testing for edge cases. It aims to demonstrate how you can set both minimum and maximum character length for resulting chunks.

# Set minimum chunk size
min_chunk_size = 200
# Set maximum chunk size
chunk_size = 350

# Initialize empty lists to store the final chunks, accumulated text, and bounding box coordinates
final_chunks = []
texts = ""
cords_dict = []

# Iterate over the document chunks
for chunk in doc.chunks():
    # Extract the text and bounding box coordinates for the current chunk
    chunk_text = chunk.to_context_text()
    chunk_cords = chunk.block_json

    # Check if adding the current chunk to the accumulated text exceeds the maximum chunk size
    if len(texts) + len(chunk_text) <= chunk_size:
        # If within the limit, append the text and coordinates to the current chunk
        texts += chunk_text
        cords_dict.append(chunk_cords)
    # Check if the accumulated text meets or exceeds the minimum chunk size
    elif len(texts) >= min_chunk_size:
        # If yes, create a new chunk with the accumulated text and coordinates
        final_chunks.append({
            'context_text': texts,
            'cords': cords_dict
        })
        # Reset the accumulated text and coordinates for the new chunk
        texts = chunk_text
        cords_dict = [chunk_cords]
    else:
        # If adding the chunk would make the current chunk size less than min_chunk_size,
        # skip to the next chunk without creating a new chunk
        texts = chunk_text
        cords_dict = [chunk_cords]

# Add the last chunk if it has content and meets the minimum chunk size
if len(texts) >= min_chunk_size:
    final_chunks.append({
        'context_text': texts,
        'cords': cords_dict
    })

@ansukla What do you think will this work? or will it degrade the chunk quality?