Closed crawlersgonnacrawl closed 2 years ago
Not in this version. Currently, Exotic's machine learning algorithm uses some features collected by rendering the whole page in a real browser, so, only web pages collected by Exotic can be learned.
Depending on your requirements, we recommend the following paper:
discovering the items through selector.
No, Exotic's auto extract algorithm do not use any kind of selector. For a corpus of webpages, Exotic encodes every DOM node into a feature vector, and then perform a semi-kmean algorithm on the data.
We need to run exotic for the HTMLs we have in our database. They are search results page from a popular search engine. We want to use the power of Exotic's auto-parse capability but the only way to trigger
harvest
seems like delivering portal/source link and discovering the items through selector.Are there any programmatic way to run Exotic's Auto Extract/Parse feature through our own HTML files instead of delivering a source?