(demo) scrape arxiv for list of publications using Gumbo + HTTP

tlienart commented 3 years ago

Here's a solution based on a discussion on slack:

using Gumbo, HTTP
r = HTTP.get("https://arxiv.org/a/soejima_t_1.html")
r_parsed = parsehtml(String(r.body))
body = r_parsed.root[2]
n_articles = length(body[1][3][2][1][2].children)
article(k) = body[1][3][2][1][2][k][2]
title(k) = strip(article(k)[1][1][2].text)
raw_authors(k) = [e for e in article(k)[1][2].children if e isa HTMLElement{:a}]
name(auth) = strip(auth[1].text)
authors(k) = name.(raw_authors(k))

this gives

julia> title(1)
"Efficient simulation of moire materials using the density matrix  renormalization group"
julia> authors(1)
5-element Array{SubString{String},1}:
 "Tomohiro Soejima"
 "Daniel E. Parker"
 "Nick Bultinck"
 "Johannes Hauschild"
 "Michael P. Zaletel"
julia> foreach(x -> println("- ", title(x)), 1:n_articles)
- Efficient simulation of moire materials using the density matrix  renormalization group
- Isometric Tensor Network representation of string-net liquids
- First-Principles Design of a Half-Filled Flat Band of the Kagome Lattice  in Two-Dimensional Metal-Organic Frameworks

Could easily get the links as well and then that's a way to get a list of publications automatically.

tlienart commented 3 years ago

Avik recommends Cascadia.jl to descend the tree a bit more cleanly

tomohiro-soejima commented 3 years ago

Here's a version using Cascadia.jl

using Cascadia, Gumbo, HTTP
r = HTTP.get("https://arxiv.org/a/warner_s_1.html")
h = parsehtml(String(r.body))
sm = Selector(".meta")
articles = eachmatch(sm, h.root)

function getauthors(article)
    sm = Selector(".list-authors")
    raw_authors = eachmatch(sm, article) |> only
    authors = [children(author)[1] |> text |> strip for author in children(raw_authors) if author isa HTMLElement{:a}]
    return authors
end

function gettitle(article)
    sm = Selector(".list-title")
    raw_title = eachmatch(sm, article) |> only
    title = raw_title[2] |> text |> strip
    return title
end

authors = getauthors.(articles)
titles = gettitle.(articles)

function article_list(authors, titles)
    s = ""
    for (author, title) in zip(authors, titles)
        author_array = join(author, ", ")
        s *= "\\publishedarticle{$author_array}{$title}\n"
    end
    return s
end

s = article_list(authors, titles)
println(s)

which gives

\publishedarticle{Bernhard Haslhofer, Simeon Warner, Carl Lagoze, Martin Klein, Robert Sanderson, Michael L. Nelson, Herbert van de Sompel}{ResourceSync: Leveraging Sitemaps for Resource Synchronization}
\publishedarticle{Simeon Warner}{Author Identifiers in Scholarly Repositories}
\publishedarticle{Carl Lagoze, Herbert Van de Sompel, Michael Nelson, Simeon Warner, Robert Sanderson, Pete Johnston}{A Web-Based Resource Model for eScience: Object Reuse & Exchange}
\publishedarticle{Carl Lagoze, Herbert Van de Sompel, Michael L. Nelson, Simeon Warner, Robert Sanderson, Pete Johnston}{Object Re-Use & Exchange: A Resource-Centric Approach}
\publishedarticle{Daria Sorokina, Johannes Gehrke, Simeon Warner, Paul Ginsparg}{Plagiarism Detection in arXiv}
......

tlienart commented 3 years ago

Awesome!! I'll add that soon

tlienart / Franklin.jl

(demo) scrape arxiv for list of publications using Gumbo + HTTP #711