Open iAMSagar44 opened 4 months ago
You can already load web pages into a vector database using the Tika DocumentReader
, but it would be great to have dedicated support for the web scraping use case. For example, it would be great having the possibility to customise the loading and transformation/splitting of web pages in an HTML-aware way (similar to what LangChain and LlamaIndex support.
Dependency:
dependencies {
...
implementation 'org.springframework.ai:spring-ai-tika-document-reader'
}
Example:
public void run() throws MalformedURLException {
List<Document> documents = new ArrayList<>();
logger.info("Loading .html files as Documents");
var documentUri = URI.create("https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/concepts.html#_models");
var htmlReader = new TikaDocumentReader(new UrlResource(documentUri));
documents.addAll(htmlReader.get());
logger.info("Creating and storing Embeddings from Documents");
var textSplitter = new TokenTextSplitter();
vectorStore.add(textSplitter.split(documents));
var similarDocuments = vectorStore.similaritySearch(SearchRequest
.query("Retrieval Augmented Generation")
.withTopK(3)
.withSimilarityThreshold(0.75));
similarDocuments.forEach(doc -> System.out.println(doc.getContent()));
}
There are some commons-compress version incompatibilities. I had to exclude and configure it as follows:
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-tika-document-reader</artifactId>
<exclusions>
<exclusion>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
<version>1.26.1</version>
</dependency>
We can include these changes to the pom.
How much more dedicated support over Tika is expected? The sample code reads well to me
Is there a feature in the pipeline to support web scraping functionality - similar to what the LangChain library has to offer (https://python.langchain.com/v0.1/docs/use_cases/web_scraping/).
It is basically to load HTML pages from a web url and transform it to text, before chunking and indexing it to a Vector Store.