utkarshkukreti / select.rs

A Rust library to extract useful data from HTML documents, suitable for web scraping.
MIT License
971 stars 69 forks source link

Use actual css selectors as strings? #45

Open keatinge opened 6 years ago

keatinge commented 6 years ago

In beautifulsoup, a html parsing library written in python, there's a method called .select(css_selector_str) it's incredibly useful for html parsing if you have knowledge of css selectors. For example, to print the question titles on stackoverflow:

import requests
from bs4 import BeautifulSoup

html = requests.get("https://stackoverflow.com/questions/tagged/rust?sort=votes&pageSize=50").text
soup = BeautifulSoup(html, "html.parser")

titles_elements = soup.select("div#questions div.summary > h3 > a")
title_text = [el.text for el in titles_elements]
print(title_text)

This prints:

["What are the differences between Rust'sStringandstr?", 'Why are explicit lifetimes needed in Rust?', "Why doesn't println! work in Rust unit tests?", 'How to access command line parameters?', 'How do I print the type of a variable in Rust?' ... (and many more)

The equivalent selector right now would be something like

let iterator = doc.find(And(Name("div"), Attr("id", "questions"))
    .descendant(And(Name("div"), Class("summary")))
    .child(Name("h3"))
    .child(Name("a")));

Would you be open to accepting css selectors as strings or is that out of the scope of this library?

blakehawkins commented 6 years ago

My understanding is that you should use scraper - https://medium.com/@kadek/web-scraping-in-rust-881b534a60f7