platonai / PulsarRPAPro

PulsarRPA Pro Edition: Empower Your Workflows with AI-Driven Web Data Extraction.
96 stars 26 forks source link

Collecting to Output of Crawling by GUI or CLI #8

Closed crawlersgonnacrawl closed 2 years ago

crawlersgonnacrawl commented 2 years ago

Project is really promising - thanks for hard work! I have finally run the app.

My main interest is just about testing our auto parse feature as you have shown in your website as demo: http://platonic.fun/i/ai?url=aHR0cHM6Ly93d3cuYW1hem9uLmNvbS9CZXN0LVNlbGxlcnMtQXV0b21vdGl2ZS96Z2JzL2F1dG9tb3RpdmUvcmVmPXpnX2JzX25hdl8w

I have tried to create something on a demo site but GUI ask me to provide SQL for parsing rules that includes selector, but I just need harvest mode:

CleanShot 2022-08-09 at 16 41 00@2x

I have tried to use this rule:

select * from harvest('https://ifconfig.me');

Project is created but it stuck on status screen as running

CleanShot 2022-08-09 at 16 42 46@2x

Then I have tried to run from CLI:

java -jar exotic-standalone.jar harvest https://ifconfig.me

The program is completed successfully but never get any prompt from CLI. Can't see any data in GUI.

How can I create a report like you have created for demo in your website? I can't code in Kotlin yet, just using bash and GUI to use harvest mode but could not get any results.

platonai commented 2 years ago

Try some e-commerce site and run:

java -jar exotic-standalone.jar harvest a-product-list-url-of-you-e-comm-website

The url in the command above should be a portal url, for example, the url of a product list page. Exotic visits the portal url, finds out the best out link set of item pages, fetches the item pages and then learn from them.

crawlersgonnacrawl commented 2 years ago

I have tried to run this, it worked for 30-40 seconds and program is closed. At this time HTOP was full of processes and I can see that it was working.

root@exotic-test:~# java -jar exotic-standalone.jar harvest https://www.trendyol.com/apple-cep-telefonu-x-b101470-c103498

How can I see the result? Where it is stored?

Here is my public link to GUI: http://5.161.58.104:2718/exotic/crawl/ (I'll delete this later)

platonai commented 2 years ago

How can I see the result? Where it is stored?

Once the system successfully completes the task, a webpage will be open automatically to show the harvest result.

crawlersgonnacrawl commented 2 years ago

Unfortunately, it does not as this is a remote machine. Any chance to return remote link as a return from CLI? If not so, I can't run from a remote machine.

platonai commented 2 years ago

There are three ways to run harvest and check the results:

  1. Run command in CLI, the results are written in files in three different formats:

    java -jar exotic-standalone.jar harvest https://www.trendyol.com/apple-cep-telefonu-x-b101470-c103498
    less "/tmp/pulsar-$USER/report/harvest/corpus/last-page-tables.json"
  2. Run X-SQL in CLI, the results are returned in tabular form:

    java -jar exotic-standalone.jar sql "select * from harvest('https://www.trendyol.com/apple-cep-telefonu-x-b101470-c103498')"
  3. Acess the REST API with X-SQL, the results are returned in json form:

    curl -X POST --location "http://5.161.58.104:2718/exotic/x/e" -H "Content-Type: text/plain" -d "
    select * from harvest('https://www.trendyol.com/apple-cep-telefonu-x-b101470-c103498')
    "