spring-projects / spring-ai

An Application Framework for AI Engineering
https://docs.spring.io/spring-ai/reference/index.html
Apache License 2.0
3.27k stars 837 forks source link

The document segmentation vectorization failed. #1499

Open oasis-zhou opened 1 month ago

oasis-zhou commented 1 month ago

The reason is that there is a problem with the batching logic in the TokenCountBatchingStrategy, which exists in version 1.0.0-M2:

public List<List> batch(List documents) { List<List> batches = new ArrayList(); int currentSize = 0; List currentBatch = new ArrayList();

int tokenCount; for(Iterator var5 = documents.iterator(); var5.hasNext(); currentSize += tokenCount) { Document document = (Document)var5.next(); tokenCount = this.tokenCountEstimator.estimate(document.getFormattedContent(this.contentFormater, this.metadataMode)); if (tokenCount > this.maxInputTokenCount) { throw new IllegalArgumentException("Tokens in a single document exceeds the maximum number of allowed input tokens"); }

if (currentSize + tokenCount > this.maxInputTokenCount) { batches.add(currentBatch); currentBatch.clear(); currentSize = 0; }

currentBatch.add(document); }

if (! currentBatch.isEmpty()) { batches.add(currentBatch); }

return batches; }

The following sentence is written incorrectly. Please refer to the writing style of version 1.0.0-SNAPSHOT: if (currentSize + tokenCount > this.maxInputTokenCount) { batches.add(currentBatch); currentBatch.clear(); currentSize = 0; }

markpollack commented 1 month ago

It isn't clear what the error is that you are trying to report. Can you show a code sample and a stack trace please?

oasis-zhou commented 1 month ago

currentBatch.clear(); Simply clearing the 'currentBatch' reference object does not create a new batch, so the contents of the previous batch will be cleared, resulting in 'batches' only containing the contents of the last batch.