kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.42k stars 444 forks source link

[Feature idea] Extract external links (github, dataset, ...) #167

Open thomasopsomer opened 7 years ago

thomasopsomer commented 7 years ago

Hey,

It could be nice to extract all external links in the PDF (in the text, or footnotes), for instance links to Github repositories or to online dataset... Just an idea :)

kermitt2 commented 7 years ago

Hi Thomas!

They are already extracted normally. In the latest version of GROBID, all the web external (and GOTO internal link in the document) annotation links are extracted in PDFAnnotation objects. However not yet outputted in the TEI yet - it will come!

thomasopsomer commented 7 years ago

Ah great ! I looked at the TEI but not the API before asking !

thomasopsomer commented 7 years ago

"In the latest version of GROBID" do you mean the stable 0.4.1 or the master branch of the repo ?

To use it directly with the library, given a Document object I can retrieve the externals with getPDFAnnotations ?!

kermitt2 commented 7 years ago

the master branch 0.4.2-SNAPSHOT

thomasopsomer commented 7 years ago

+1 it's working :)