raznem / parsera

Lightweight library for scraping web-sites with LLMs
https://parsera.org
GNU General Public License v2.0
859 stars 55 forks source link

Context window exceeded #11

Open spinagon opened 2 months ago

spinagon commented 2 months ago

When running with a llama.cpp model (from langchain_community.llms import LlamaCpp) I get ValueError: Requested tokens (113654) exceed context window of 4096 I'm not sure what happens with other backends, they probably just trim everything that doesn't fit into context.

raznem commented 2 months ago

There is no handling of long inputs, so small context models will fail. Longer context models (128k), work well, but for small context models chunking and aggregation are yet to be implemented.

raznem commented 1 week ago

Today, v0.1.12 was released with ChunksTabularExtractor used by default. It allows the processing of pages with smaller context models by setting the chunk_size and token_counter arguments in Parsera. For details, check the documentation: https://docs.parsera.org/features/extractors/#chunks-tabular-extractor