Open splendidbug opened 3 months ago
Great start! It's very clear.
I just had a couple of stylistic comments - let me know what you think.
It would also be good to add some minimal tests in folder test/
-> the entry point is usually file test/runtests.jl
and you'll need to add [extras] to your Project.toml - see the parent project for example.
It would be good to cover at least the main pathways of the core functions
@splendidbug How are you thinking about the scope of this PR before we merge it?
I think we can have it open for a while, but I'd like to see the following prior to merge:
functionality:
tests for the main behaviours
It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon).
Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR.
WDYT?
@splendidbug How are you thinking about the scope of this PR before we merge it?
I think we can have it open for a while, but I'd like to see the following prior to merge:
* functionality: * download web page (can be in memory, or saved to a local file for future reference/re-parsing, it should reflect the webpage structure. HTML file only) * parsing HTML page * extracting links from HTML page * filtering pages to only the same domain(?, should be an argument * top-level functionality to run the scrape->links->parse & repeat on other links in scope * tests for the main behaviours
It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon).
Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR.
WDYT?
Sounds good and I agree that keeping the PR open to add other functionalities is a good idea Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?
Also when we start implementing the crawler, we'll have to take care of memory overflow right?
Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?
Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think
Also when we start implementing the crawler, we'll have to take care of memory overflow right? Interesting. Why is that a concern for you? Assuming scraping one site at a time, we shouldn’t get anywhere near RAM limits on a standard laptop (I’d expect MBs of data at most. We can add serialization step when we run the scraper for 10000s websites in a loop, but that won’t happen and user can do it themselves if they have such huge task.
Or did you have a different concern?
Separately, could you please go through the review and click resolve on the feedback you tackled already? Also, there were some suggestions - in the future, you can just accept those and it will make the changes for you :)
Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?
Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think
Also when we start implementing the crawler, we'll have to take care of memory overflow right? Interesting. Why is that a concern for you? Assuming scraping one site at a time, we shouldn’t get anywhere near RAM limits on a standard laptop (I’d expect MBs of data at most. We can add serialization step when we run the scraper for 10000s websites in a loop, but that won’t happen and user can do it themselves if they have such huge task.
Or did you have a different concern?
That was my concern. Thanks!
Btw as mentioned on Slack, an easy way to accumulate strings is to pass around an io=IOBuffer()
which is io::IO
in the process_node
function, that the child nodes can add into (eg, print(io,…)
or write(io,…)
) I prefer print somehow - it avoids having to instantiate all the intermediate strings..
You extract them from io via str = String(take!(io))
(which resets the positions in io / removes all its content, ie, you can do it only once)
As discussed, I haven't looked at the parser yet.
One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there. Ie,
module DocsScraper
using AbstractTrees
using Gumbo
using HTTP
using URIs
export x,y,z
include("parser.jl")
export x,y,z
include("...")
end
On a separate note, I took it for a spin and parsed docs across several packages -- I haven't verified all in detail, but at least it runs across several doc site types (it required small tweaks).
It integrated nicely into the PromptingTools RAGTools:
## Load up all Makie docs
dirs = ["makie/Makie.jl-gh-pages/dev",
"makie/AlgebraOfGraphics.jl-gh-pages/dev",
"makie/GeoMakie.jl-gh-pages/dev",
"makie/GraphMakie.jl-gh-pages/dev",
"makie/MakieThemes.jl-gh-pages/dev",
"makie/TopoPlots.jl-gh-pages/dev",
"makie/Tyler.jl-gh-pages/dev"
]
output_chunks = Vector{SubString{String}}()
output_sources = Vector{String}()
for dir in dirs
dir = dirs[1]
@info ">> Directory: $dir"
files = mapreduce(x -> joinpath.(Ref(x[1]), x[3]), vcat, walkdir(dir))
files = filter(x -> endswith(x, ".html"), files)
chunks, sources = RT.get_chunks(DocParserChunker(), files)
append!(output_chunks, chunks)
append!(output_sources, sources)
end
length(output_chunks), length(output_sources)
I can share the full script of methods added, but what's relevant is probably this. You probably recognize your HTML parser. I had to add support for Documenter, Franklin, VitePress (each catching slightly different HTML object as "content" node). Documented in the comments.
## HTML parser from txt -> vector of dict (ie, skips the download)
"Parses an HTML string into a vector of Dicts with text and metadata. Returns: `parsed_blocks` and `title` of the document."
function parse_html_to_blocks(txt::String)
parsed_blocks = Vector{Dict{String,Any}}()
heading_hierarchy = Dict{Symbol,Any}()
r_parsed = parsehtml(txt)
# Getting title of the document
title = [el
for el in AbstractTrees.PreOrderDFS(r_parsed.root)
if el isa HTMLElement && tag(el) == :title] .|> text |> Base.Fix2(join, " / ")
# Content markers:
# Documenter: div:docs-main, article: content (within div:#documenter)
# Franklin: div:main -> div:franklin-content (within div:#main)
# Vitepress: div:#VPContent
## Look for element ID (for Vitepress only)
content_ = [el
for el in AbstractTrees.PreOrderDFS(r_parsed.root)
if el isa HTMLElement && getattr(el, "id", nothing) in ["VPContent"]]
if length(content_) == 0
## Fallback, looking for a class
content_ = [el
for el in AbstractTrees.PreOrderDFS(r_parsed.root)
if el isa HTMLElement && getattr(el, "class", nothing) in ["content", "franklin-content"]]
end
if length(content_) > 0
process_node!(only(content_), heading_hierarchy, parsed_blocks)
end
return parsed_blocks, title
end
This is not a reference implementation, just a quick hack to test it out. I'm saving here for future reference.
As discussed, I haven't looked at the parser yet.
One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there. Ie,
module DocsScraper using AbstractTrees using Gumbo using HTTP using URIs export x,y,z include("parser.jl") export x,y,z include("...") end
Gotcha, will make changes
This is a new subrepo dedicated to crawling, scraping, parsing and chunking of Julia's documentation (GSOC project).
DocsScraper contains DocsScraper.jl file where the code for parser is implemented.
Usage:
parsed_text = parse_url("https://docs.julialang.org/en/v1/base/multi-threading/")
Returns: A Vector of Dict containing Heading/Text/Code along with a Dict of respective metadata
Requirements: HTTP, Gumbo, AbstractTrees, URIs