arXiv - Githubissues

davidar commented 9 years ago

Now that the first Creative Commons complete arXiv dataset has been published (ipfs/archives#2), it's time to build some cool apps on top of it!

Some ideas (please add more in the comments):

search engine for LaTeX equations, like http://latexsearch.com/
publishing papers in alternative formats (html, etc)
building a citation graph
training a topic model (latent Dirichlet allocation / hierarchical Dirichlet processes)
automatically extracting definitions to build a dictionary of terminology
semantic markup of math (connecting variables in equations to their textual descriptions, disambiguating notation, etc)

CC: @jbenet @rht

jbenet commented 9 years ago

rht commented 9 years ago

To be more precise, s/arXiv dataset/CC arXiv dataset/.

publishing papers in alternative formats (html, etc)

Anything that is !pdf (well there is \usepackage{hyperref}, but extracting semantic data is a pain) and machine parseable.

building a citation graph

And e.g. apt-get install /doi/$doi --depth 2, citation ("vendor/") auto-suggest.

training a topic model

https://blog.lateral.io/2015/07/harvesting-research-arxiv/ uses the abstract dataset to create a recommender system. Though I find arxiv's search result to be far more relevant (test case: piron+lattice).

(to be continued...)

rht commented 9 years ago

automatically extracting definitions to build a dictionary of terminology

I wonder if you can use statistical models on such 'raw' data. As much as terms in scientific papers are expected to be more 'regular' than those of literature, demarcating 'real' papers is not something a mere mortal could do, never mind machines, e.g. arxiv vs snarxiv. This is made worse if the authors had the tendency to sound like the latter (e.g. http://arxiv.org/abs/hep-th/0003075, or a report on a certain Greek island).

Established scientific terms are definitely more regular (more regular than terms in the rest of language) [1].

There are definition environments in maths that can be extracted, but they are often localized instead of global terms for a dictionary.

semantic markup of math (connecting variables in equations to their textual descriptions, disambiguating notation, etc)

THIS. (on disambiguating notation: using nlp to reverse-engineer the source code back is tough. Another path is to tell people to actually use unambiguous notations, in e.g. "Calculus on Manifolds", SICM, "Functional Differential Geometry" [2])

[1] Putnam suggested a "division of linguistic labor" (done by scientists from each fields) to define meaning http://libgen.io/get.php?md5=C931B36BCE8C21DA613AC02C40F634DC (what to do with this type of link?) [2] http://mitpress.mit.edu/sites/default/files/titles/content/sicm/book-Z-H-79.html#%_chap_8

rht commented 9 years ago

It's already terse, but the tl; dr is whether to approach the data using ML or GOFAI.

zignig commented 9 years ago

AWESOME! :+1:

davidar commented 9 years ago

@rht glad to see I'm not the only one who's been thinking about this :)

Anything that is !pdf (well there is \usepackage{hyperref}, but extracting semantic data is a pain) and machine parseable.

See http://dlmf.nist.gov/LaTeXML and the (now defunct I suspect) https://trac.kwarc.info/arXMLiv

And e.g. apt-get install /doi/$doi --depth 2, citation ("vendor/") auto-suggest.

That would be cool. I also want to port https://github.com/davidar/bib to IPFS, which currently uses a half-baked content-addressable-storage for fulltext.

automatically extracting definitions to build a dictionary of terminology

I wonder if you can use statistical models on such 'raw' data. [...] on disambiguating notation: using nlp to reverse-engineer [...]

Yeah, it wouldn't be trivial, but it's a field that has been studied, e.g.:

Another path is to tell people to actually use unambiguous notations

That would be nice (see http://www.texmacs.org/joris/semedit/semedit.html ), but probably not going to happen on a large scale.

tl; dr is whether to approach the data using ML or GOFAI.

My area is probabilistic (Bayesian) machine learning, but simpler approaches may well be Good Enough for some of this.

rht commented 9 years ago

tex -> html: In the past, people have been using tex4ht / tex2page.

tex -> xml (for parsing): Someone has to do it eventually... Pandoc (the "llvm" of markup lang) currently has less latex coverage than latexml, but has better foundation (especially when vs perl scripts) and it connects with other markup langs. Looking at http://arxmliv.kwarc.info/top_macros.php, I think this task is about the scale of vim -> neovim refactor (then what about the scale of reengineering the web? ...don't ask which leg moves after which).

(to be continued...)

davidar commented 9 years ago

Looking at http://arxmliv.kwarc.info/top_macros.php, I think this task is about the scale of vim -> neovim refactor

Keep in mind that, if you took care of the top 7 macros in that list, the remaining macros are used in less than 1% of papers, so it's not too bad. As it is, it's still less than 3%.

dginev commented 8 years ago

@rht said:

Pandoc (the "llvm" of markup lang) currently has less latex coverage than latexml, but has better foundation (especially when vs perl scripts) and it connects with other markup langs.

Which foundation is that? I happen to be biased on the subject, but when it comes to dealing with arXiv pandoc's coverage can only be described as "basic". When it comes to evaluating TeX programs (which LaTeX papers are), it's pandoc's TeX reader that you can qualify as "Haskell scripts", while LaTeXML has a fully fleshed out implementation of a TeX engine as a foundation.

Pandoc indeed has a very powerful model that allows to connect readers and writers of concrete syntaxes via an internal abstract syntax. In fact, if the abstract model evolves to meet the coverage of LaTeXML's XML schema, it could be a wonderful post-processor for the LaTeXML ecosystem. And vice-versa, LaTeXML can be a wonderful "TeX reader" for pandoc. For me personally it would be quite curious to see the two projects interoperate, as they focus on, and excel at, different problems.

davidar commented 8 years ago

@dginev what needs to be added to pandoc's abstract model to support LaTeXML?

dginev commented 8 years ago

Keep in mind I am not an expert in the Pandoc model. But I have seen the occasional comment suggesting that Pandoc wants to "stay pure" and be restricted in coverage in certain respects. To quote one closed issue:

Remember, pandoc is about document structure. CSS is about details of presentation. To change the size of headers in tex output, you could use a custom latex template (see the documentation).

That is a very noble sentiment, and LaTeXML mostly shares it, but remains open to eventually covering all of TeX's pathological typesetting acrobatics. Here is a relatively simple example of what I have in mind.

The restrictions pandoc is self-imposing keep it quite elegant, and make tying in different syntaxes manageable, but it also limits the depth of the support. In order to meaningfully handle arXiv, making a compromise on elegance in order to have enough expressivity is a necessity - or you end up losing half (or more) of the content.

But even when it comes to document structure, I am unsure how far pandoc has dealt with support for the "advanced" structures out there - indexes, glossaries, bibliographies, wacky constructions such as inline blocks, math in TeX graphics (e.g. Tikz images that convert to SVG+MathML), etc. In this respect I find tex4ht to be a much more impressive and suitable comparison for LaTeXML, and both dedicated TeX converters strive to get better with covering the full spectrum of structure and style that TeX allows for, and authors use on a regular basis.

dginev commented 8 years ago

On a more pragmatic note, I am sure a LaTeXML-reader integration for pandoc is already possible today, by just mapping whatever structure is currently overlapping. May be a rather nice and simple project actually, it's just matching an XML DOM in LaTeXML's schema to Pandoc's internal datastructures. I will think of spending a few hours and doing that on a weekend.

davidar commented 8 years ago

@dginev It would be really cool to have a "universal" document model, but as you say, it's a tricky problem. I've spent a little bit of time thinking about it in the past, but can't say I came up with any solutions :(

I will think of spending a few hours and doing that on a weekend.

Awesome, let me know how you go :)

I can help with things on the Haskell side, if necessary. I'll leave the Perl stuff to you though :p

ipfs / apps

arXiv #1