pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.93k stars 1.93k forks source link

Add read_html and some related methods #13063

Open abstractqqq opened 10 months ago

abstractqqq commented 10 months ago

Description

It is common scenario to want to read a table directly from a webpage, e.g. from a Wikipedia article. Pandas has support for this, see here: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

Since this is a Polars feature request, I think a lazy version would be desired too.

And since we are parsing tables from the internet, I think we might as well add unicode normalization support via the unicode_normalization crate. This article explains what it is and why we need to do it: https://pbpython.com/pandas-html-table.html

As a side note, I realized we might also add support for parsing data in XML format. I know that some big legacy data providers are indeed still using XML to send tabular data. Here is a Rust crate that might be useful: https://docs.rs/xmlparser/latest/xmlparser/

deanm0000 commented 10 months ago

A quick good search shows a few rust based html table parsers so that could be good.