Open leohpark opened 8 months ago
Agreed! Will add it to the list of things and happy to take PRs if anyone wants to pick this up.
Hi there!. Thought you might find the following code snippet helpful. Please note that this is a basic example and might need further testing for edge cases. It aims to demonstrate how you can set both minimum and maximum character length for resulting chunks.
# Set minimum chunk size
min_chunk_size = 200
# Set maximum chunk size
chunk_size = 350
# Initialize empty lists to store the final chunks, accumulated text, and bounding box coordinates
final_chunks = []
texts = ""
cords_dict = []
# Iterate over the document chunks
for chunk in doc.chunks():
# Extract the text and bounding box coordinates for the current chunk
chunk_text = chunk.to_context_text()
chunk_cords = chunk.block_json
# Check if adding the current chunk to the accumulated text exceeds the maximum chunk size
if len(texts) + len(chunk_text) <= chunk_size:
# If within the limit, append the text and coordinates to the current chunk
texts += chunk_text
cords_dict.append(chunk_cords)
# Check if the accumulated text meets or exceeds the minimum chunk size
elif len(texts) >= min_chunk_size:
# If yes, create a new chunk with the accumulated text and coordinates
final_chunks.append({
'context_text': texts,
'cords': cords_dict
})
# Reset the accumulated text and coordinates for the new chunk
texts = chunk_text
cords_dict = [chunk_cords]
else:
# If adding the chunk would make the current chunk size less than min_chunk_size,
# skip to the next chunk without creating a new chunk
texts = chunk_text
cords_dict = [chunk_cords]
# Add the last chunk if it has content and meets the minimum chunk size
if len(texts) >= min_chunk_size:
final_chunks.append({
'context_text': texts,
'cords': cords_dict
})
@ansukla What do you think will this work? or will it degrade the chunk quality?
Hello, looks great so far. Would appreciate the ability to include parameters in the API call that specify both a minimum and maximum character length for the resulting chunks.