suaviloquence / scrapelect

Declarative web scraping DSL with CSS-inspired syntax
https://suaviloquence.github.io/scrapelect/
Apache License 2.0
1 stars 0 forks source link

scrapelect

scrapelect is a web scraping language inspired by CSS that turns a web page into structured JSON data. Select elements with CSS selectors, apply filters to extract and modify the data you want from a web page, and get the output in a structured, machine-readable, interoperable format.

installation

Install the Rust toolchain. Using cargo, run:

$ cargo install scrapelect

to install the scrapelect interpreter.

usage

Write a scrapelect program into a .scrp file. Documentation for the language can be found in the scrapelect book.

A quick example, title.scrp, retrieves the title of a Wikipedia article:

title: .mw-page-title-main {
  content: $element | text();
};

Run the scrp with the URL of the web page to scrape:

$ scrapelect title.scrp "https://en.wikipedia.org/wiki/Cat"

It will output:

{
  "title": {
    "content": "Cat"
  }
}

documentation

community

license

scrapelect is available under the MIT or Apache 2 licenses, at your option. Copies of these licenses are included at LICENSE-MIT and LICENSE-APACHE at the root directory.

scrapelect: scrape + select, also -lect