platonai / PulsarRPAPro

PulsarRPA Pro Edition: Empower Your Workflows with AI-Driven Web Data Extraction.
95 stars 26 forks source link

Running Auto Extract for Bunch of URLs in List #9

Closed crawlersgonnacrawl closed 2 years ago

crawlersgonnacrawl commented 2 years ago

We need to run exotic for the HTMLs we have in our database. They are search results page from a popular search engine. We want to use the power of Exotic's auto-parse capability but the only way to trigger harvest seems like delivering portal/source link and discovering the items through selector.

Are there any programmatic way to run Exotic's Auto Extract/Parse feature through our own HTML files instead of delivering a source?

platonai commented 2 years ago

Not in this version. Currently, Exotic's machine learning algorithm uses some features collected by rendering the whole page in a real browser, so, only web pages collected by Exotic can be learned.

Depending on your requirements, we recommend the following paper:

  1. WebFormer for Web data extraction: https://dl.acm.org/doi/pdf/10.1145/3485447.3512032
platonai commented 2 years ago

discovering the items through selector.

No, Exotic's auto extract algorithm do not use any kind of selector. For a corpus of webpages, Exotic encodes every DOM node into a feature vector, and then perform a semi-kmean algorithm on the data.