tremor-rs / tremor-runtime

Main Tremor Project Rust Codebase
https://www.tremor.rs
Apache License 2.0
855 stars 125 forks source link

Html extractor #2072

Open happysalada opened 1 year ago

happysalada commented 1 year ago

Describe the problem you are trying to solve Exctract data from an html page. Lots of older sites with valuabke data dont have an api. Extracting html with a regex is possible but very inconvenient

Describe the solution you'd like An html extractor whete you would have an api similat yo css selectors

Notes

If this is an implementation of an RFC provide a URL to the RFC this enhancement implements.

If this is a major enhancement or contribution an RFC may be required. It is ok to submit an enhancement first and our core team will assist with major contributions. In general, major contributions should be discussed with the community before submission.

Licenser commented 1 year ago

This is quite an interesting idea, I like it! It goes a bit further and might be worth a RFC as there are some extra things to consider. When we have an HTML extractor, we will need a structural representation of the data once it's extracted. That leads to an HTML codec that both decodes HTML into this structure and encodes this structure into an HTML page (which could be super cool to be honest).

@happysalada how do you feel about throwing an RFC up on the topic?

happysalada commented 1 year ago

Let me try to carve some time for this.

Licenser commented 1 year ago

Awesome, thanks!