oxinabox / DataDepsGenerators.jl

Utility for developers to help define DataDeps registration blocks, for reusing existing Data with DataDeps.jl
Other
18 stars 6 forks source link

Generic Generator based on scraping and filtering all URLs from a page #3

Open oxinabox opened 6 years ago

oxinabox commented 6 years ago

Here is my idea for how to make a generic DataDeps generator for any website.

Have a function that is something like:

generate(website, href_filter, content_filter)

website is a URL, scraper visits that website and finds all the links.

they are then filtered based on href_filter which is a function/Regex/Glob saying if the href attribute of the link matches,

and though content_filter which is a function/regex agains the plain text content (strip the tags) to see if that matches,

Then finally the results are presented in a Multiselect Terminal menu (https://github.com/nick-paul/TerminalMenus.jl)

So for example

generate("www.example.com", href=glob"*.zip", content=r".*download.*")

The idea being that most of the time, for pages with only a few links on can just call it without any filters and then select the links you want using the terminal menu. But if there are too many links you can ctrl+C and add a filter.

The generated dataset can include the website URL in the message. Maybe also there is some kinda twittercard or other header info that can be used.

Probably also want a version that takes a list of sites to scrape from and combined all their hyperlinks in a Set.

oxinabox commented 6 years ago

exposing the generic apache index page scraper as a DataRepo might be a good state for this