Open tlienart opened 3 years ago
Avik recommends Cascadia.jl to descend the tree a bit more cleanly
Here's a version using Cascadia.jl
using Cascadia, Gumbo, HTTP
r = HTTP.get("https://arxiv.org/a/warner_s_1.html")
h = parsehtml(String(r.body))
sm = Selector(".meta")
articles = eachmatch(sm, h.root)
function getauthors(article)
sm = Selector(".list-authors")
raw_authors = eachmatch(sm, article) |> only
authors = [children(author)[1] |> text |> strip for author in children(raw_authors) if author isa HTMLElement{:a}]
return authors
end
function gettitle(article)
sm = Selector(".list-title")
raw_title = eachmatch(sm, article) |> only
title = raw_title[2] |> text |> strip
return title
end
authors = getauthors.(articles)
titles = gettitle.(articles)
function article_list(authors, titles)
s = ""
for (author, title) in zip(authors, titles)
author_array = join(author, ", ")
s *= "\\publishedarticle{$author_array}{$title}\n"
end
return s
end
s = article_list(authors, titles)
println(s)
which gives
\publishedarticle{Bernhard Haslhofer, Simeon Warner, Carl Lagoze, Martin Klein, Robert Sanderson, Michael L. Nelson, Herbert van de Sompel}{ResourceSync: Leveraging Sitemaps for Resource Synchronization}
\publishedarticle{Simeon Warner}{Author Identifiers in Scholarly Repositories}
\publishedarticle{Carl Lagoze, Herbert Van de Sompel, Michael Nelson, Simeon Warner, Robert Sanderson, Pete Johnston}{A Web-Based Resource Model for eScience: Object Reuse & Exchange}
\publishedarticle{Carl Lagoze, Herbert Van de Sompel, Michael L. Nelson, Simeon Warner, Robert Sanderson, Pete Johnston}{Object Re-Use & Exchange: A Resource-Centric Approach}
\publishedarticle{Daria Sorokina, Johannes Gehrke, Simeon Warner, Paul Ginsparg}{Plagiarism Detection in arXiv}
......
Awesome!! I'll add that soon
Here's a solution based on a discussion on slack:
this gives
Could easily get the links as well and then that's a way to get a list of publications automatically.