The idea being that most of the time, for pages with only a few links on can just call it without any filters and then select the links you want using the terminal menu.
But if there are too many links you can ctrl+C and add a filter.
The generated dataset can include the website URL in the message.
Maybe also there is some kinda twittercard or other header info that can be used.
Probably also want a version that takes a list of sites to scrape from and combined all their hyperlinks in a Set.
Here is my idea for how to make a generic DataDeps generator for any website.
Have a function that is something like:
generate(website, href_filter, content_filter)
website
is a URL, scraper visits that website and finds all the links.they are then filtered based on
href_filter
which is a function/Regex/Glob saying if the href attribute of the link matches,and though content_filter which is a function/regex agains the plain text content (strip the tags) to see if that matches,
Then finally the results are presented in a Multiselect Terminal menu (https://github.com/nick-paul/TerminalMenus.jl)
So for example
generate("www.example.com", href=glob"*.zip", content=r".*download.*")
The idea being that most of the time, for pages with only a few links on can just call it without any filters and then select the links you want using the terminal menu. But if there are too many links you can
ctrl+C
and add a filter.The generated dataset can include the website URL in the message. Maybe also there is some kinda twittercard or other header info that can be used.
Probably also want a version that takes a list of sites to scrape from and combined all their hyperlinks in a
Set
.