DocsScraper sub repo - Githubissues

splendidbug commented 3 months ago

This is a new subrepo dedicated to crawling, scraping, parsing and chunking of Julia's documentation (GSOC project).

DocsScraper contains DocsScraper.jl file where the code for parser is implemented.

Usage: parsed_text = parse_url("https://docs.julialang.org/en/v1/base/multi-threading/")

Returns: A Vector of Dict containing Heading/Text/Code along with a Dict of respective metadata

Requirements: HTTP, Gumbo, AbstractTrees, URIs

svilupp commented 3 months ago

Great start! It's very clear.

I just had a couple of stylistic comments - let me know what you think.

svilupp commented 3 months ago

It would also be good to add some minimal tests in folder test/ -> the entry point is usually file test/runtests.jl and you'll need to add [extras] to your Project.toml - see the parent project for example.

It would be good to cover at least the main pathways of the core functions

svilupp commented 3 months ago

@splendidbug How are you thinking about the scope of this PR before we merge it?

I think we can have it open for a while, but I'd like to see the following prior to merge:

functionality:
- download web page (can be in memory, or saved to a local file for future reference/re-parsing, it should reflect the webpage structure. HTML file only)
- parsing HTML page
- extracting links from HTML page
- filtering pages to only the same domain(?, should be an argument
- top-level functionality to run the scrape->links->parse & repeat on other links in scope
tests for the main behaviours

It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon).

Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR.

WDYT?

splendidbug commented 3 months ago

@splendidbug How are you thinking about the scope of this PR before we merge it?

I think we can have it open for a while, but I'd like to see the following prior to merge:
* functionality:

  * download web page (can be in memory, or saved to a local file for future reference/re-parsing, it should reflect the webpage structure. HTML file only)
  * parsing HTML page
  * extracting links from HTML page
  * filtering pages to only the same domain(?, should be an argument
  * top-level functionality to run the scrape->links->parse & repeat on other links in scope

* tests for the main behaviours
It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon).

Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR.

WDYT?

Sounds good and I agree that keeping the PR open to add other functionalities is a good idea Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Also when we start implementing the crawler, we'll have to take care of memory overflow right?

svilupp commented 3 months ago

Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think

Also when we start implementing the crawler, we'll have to take care of memory overflow right? Interesting. Why is that a concern for you? Assuming scraping one site at a time, we shouldn’t get anywhere near RAM limits on a standard laptop (I’d expect MBs of data at most. We can add serialization step when we run the scraper for 10000s websites in a loop, but that won’t happen and user can do it themselves if they have such huge task.

Or did you have a different concern?

svilupp commented 3 months ago

Separately, could you please go through the review and click resolve on the feedback you tackled already? Also, there were some suggestions - in the future, you can just accept those and it will make the changes for you :)

splendidbug commented 3 months ago

Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think

Also when we start implementing the crawler, we'll have to take care of memory overflow right? Interesting. Why is that a concern for you? Assuming scraping one site at a time, we shouldn’t get anywhere near RAM limits on a standard laptop (I’d expect MBs of data at most. We can add serialization step when we run the scraper for 10000s websites in a loop, but that won’t happen and user can do it themselves if they have such huge task.

Or did you have a different concern?

That was my concern. Thanks!

svilupp commented 3 months ago

Btw as mentioned on Slack, an easy way to accumulate strings is to pass around an io=IOBuffer() which is io::IO in the process_node function, that the child nodes can add into (eg, print(io,…) or write(io,…)) I prefer print somehow - it avoids having to instantiate all the intermediate strings.. You extract them from io via str = String(take!(io)) (which resets the positions in io / removes all its content, ie, you can do it only once)

svilupp commented 3 months ago

As discussed, I haven't looked at the parser yet.

One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there. Ie,

module DocsScraper

using AbstractTrees
using Gumbo
using HTTP
using URIs

export x,y,z
include("parser.jl")

export x,y,z
include("...")

end

svilupp commented 3 months ago

On a separate note, I took it for a spin and parsed docs across several packages -- I haven't verified all in detail, but at least it runs across several doc site types (it required small tweaks).

It integrated nicely into the PromptingTools RAGTools:

## Load up all Makie docs
dirs = ["makie/Makie.jl-gh-pages/dev",
    "makie/AlgebraOfGraphics.jl-gh-pages/dev",
    "makie/GeoMakie.jl-gh-pages/dev",
    "makie/GraphMakie.jl-gh-pages/dev",
    "makie/MakieThemes.jl-gh-pages/dev",
    "makie/TopoPlots.jl-gh-pages/dev",
    "makie/Tyler.jl-gh-pages/dev"
]
output_chunks = Vector{SubString{String}}()
output_sources = Vector{String}()

for dir in dirs
    dir = dirs[1]
    @info ">> Directory: $dir"
    files = mapreduce(x -> joinpath.(Ref(x[1]), x[3]), vcat, walkdir(dir))
    files = filter(x -> endswith(x, ".html"), files)
    chunks, sources = RT.get_chunks(DocParserChunker(), files)
    append!(output_chunks, chunks)
    append!(output_sources, sources)
end

length(output_chunks), length(output_sources)

I can share the full script of methods added, but what's relevant is probably this. You probably recognize your HTML parser. I had to add support for Documenter, Franklin, VitePress (each catching slightly different HTML object as "content" node). Documented in the comments.

## HTML parser from txt -> vector of dict (ie, skips the download)
"Parses an HTML string into a vector of Dicts with text and metadata. Returns: `parsed_blocks` and `title` of the document."
function parse_html_to_blocks(txt::String)
    parsed_blocks = Vector{Dict{String,Any}}()
    heading_hierarchy = Dict{Symbol,Any}()

    r_parsed = parsehtml(txt)

    # Getting title of the document 
    title = [el
             for el in AbstractTrees.PreOrderDFS(r_parsed.root)
             if el isa HTMLElement && tag(el) == :title] .|> text |> Base.Fix2(join, " / ")

    # Content markers:
    # Documenter: div:docs-main, article: content (within div:#documenter)
    # Franklin: div:main -> div:franklin-content (within div:#main)
    # Vitepress: div:#VPContent

    ## Look for element ID (for Vitepress only)
    content_ = [el
                for el in AbstractTrees.PreOrderDFS(r_parsed.root)
                if el isa HTMLElement && getattr(el, "id", nothing) in ["VPContent"]]
    if length(content_) == 0
        ## Fallback, looking for a class
        content_ = [el
                    for el in AbstractTrees.PreOrderDFS(r_parsed.root)
                    if el isa HTMLElement && getattr(el, "class", nothing) in ["content", "franklin-content"]]
    end

    if length(content_) > 0
        process_node!(only(content_), heading_hierarchy, parsed_blocks)
    end

    return parsed_blocks, title
end

This is not a reference implementation, just a quick hack to test it out. I'm saving here for future reference.

splendidbug commented 2 months ago

As discussed, I haven't looked at the parser yet.

One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there. Ie,
module DocsScraper

using AbstractTrees
using Gumbo
using HTTP
using URIs

export x,y,z
include("parser.jl")

export x,y,z
include("...")

end

Gotcha, will make changes

svilupp / AIHelpMe.jl

DocsScraper sub repo #5