platonai / PulsarRPAPro

PulsarRPA Pro Edition: Empower Your Workflows with AI-Driven Web Data Extraction.
96 stars 26 forks source link
ai auto-web-mining mlscraping rpa web-crawler web-extraction web-scraping

PulsarRPAPro README

English | 简体中文 | 中国镜像

Auto Extraction Result Snapshot

PulsarRPAPro is the professional version of PulsarRPA, featuring an upgraded server, a collection of top e-commerce site scraping examples, and an advanced AI-powered applet for automatic data extraction.

Never write another web scraper. PulsarRPAPro learns from the website and delivers web data completely and accurately at scale.

There are already dozens of scraping cases for the most popular websites, and we are constantly adding more.

Videos

YouTube: Watch the video

Bilibili: https://www.bilibili.com/video/BV1kM2rYrEFC

Features

System Requirements

Download & Run

Download the latest executable jar:

wget http://static.platonic.fun/repo/ai/platon/exotic/PulsarRPAPro.jar
# start MongoDB
docker-compose -f docker/docker-compose.yaml up
java -jar PulsarRPAPro.jar
java -jar PulsarRPAPro.jar harvest "https://www.amazon.com/b?node=1292115011" -diagnose -refresh

Build from Source

Add the following lines to your .m2/settings.xml:

<mirrors>
    <mirror>
        <id>maven-default-http-blocker</id>
        <mirrorOf>dummy</mirrorOf>
        <name>Dummy mirror to override default blocking mirror that blocks http</name>
        <url>http://0.0.0.0/</url>
    </mirror>
</mirrors>
git clone https://github.com/platonai/PulsarRPAPro.git
cd PulsarRPAPro
./mvnw clean && ./mvnw
cd PulsarRPAPro/target/

# Don't forget to start MongoDB
docker-compose -f docker/docker-compose.yaml up

For Chinese developers, we strongly suggest following this guide to accelerate the build process.

Run the Standalone Server and Open Web Console

java -jar PulsarRPAPro.jar serve

If PulsarRPAPro is running in GUI mode, the web console should open within a few seconds, or you can open it manually at:

http://localhost:2718/exotic/crawl/

Run Auto Extraction

You can use the harvest command to learn from a set of item pages using unsupervised machine learning.

java -jar PulsarRPAPro.jar harvest "https://www.amazon.com/b?node=1292115011" -diagnose -refresh

The URL in the command should be a portal URL, such as a product listing page URL.

PulsarRPAPro will visit the portal, identify the optimal set of links for item pages, retrieve those pages, and analyze them.

Here is the full page of the auto extraction result in HTML format:

Auto Extraction Result of Amazon

Explore the PulsarRPAPro Executable Jar

Run the executable jar directly for help and to explore more features:

java -jar PulsarRPAPro.jar

This command will print the help message and some of the most useful examples.

Q&A

Q: How to use proxies?

A: Follow this guide for proxy rotation.