spring-projects / spring-ai

An Application Framework for AI Engineering
https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/index.html
Apache License 2.0
2.95k stars 735 forks source link

Web scraping ETL #752

Open iAMSagar44 opened 4 months ago

iAMSagar44 commented 4 months ago

Is there a feature in the pipeline to support web scraping functionality - similar to what the LangChain library has to offer (https://python.langchain.com/v0.1/docs/use_cases/web_scraping/).

It is basically to load HTML pages from a web url and transform it to text, before chunking and indexing it to a Vector Store.

ThomasVitale commented 4 months ago

You can already load web pages into a vector database using the Tika DocumentReader, but it would be great to have dedicated support for the web scraping use case. For example, it would be great having the possibility to customise the loading and transformation/splitting of web pages in an HTML-aware way (similar to what LangChain and LlamaIndex support.

Dependency:

dependencies {
    ...
    implementation 'org.springframework.ai:spring-ai-tika-document-reader'
}

Example:

public void run() throws MalformedURLException {
        List<Document> documents = new ArrayList<>();

        logger.info("Loading .html files as Documents");
        var documentUri = URI.create("https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/concepts.html#_models");
        var htmlReader = new TikaDocumentReader(new UrlResource(documentUri));
        documents.addAll(htmlReader.get());

        logger.info("Creating and storing Embeddings from Documents");
        var textSplitter = new TokenTextSplitter();
        vectorStore.add(textSplitter.split(documents));

        var similarDocuments = vectorStore.similaritySearch(SearchRequest
                .query("Retrieval Augmented Generation")
                .withTopK(3)
                .withSimilarityThreshold(0.75));
        similarDocuments.forEach(doc -> System.out.println(doc.getContent()));
}
sivaprasadreddy commented 3 months ago

There are some commons-compress version incompatibilities. I had to exclude and configure it as follows:

 <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-tika-document-reader</artifactId>
      <exclusions>
          <exclusion>
              <groupId>org.apache.commons</groupId>
              <artifactId>commons-compress</artifactId>
          </exclusion>
      </exclusions>
  </dependency>
  <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-compress</artifactId>
      <version>1.26.1</version>
  </dependency>
markpollack commented 2 months ago

We can include these changes to the pom.

How much more dedicated support over Tika is expected? The sample code reads well to me