ropensci / unconf16

rOpenSci's San Francisco hackathon/unconf 2016
http://unconf16.ropensci.org
24 stars 7 forks source link

How to attribute a location to a scientific article #8

Closed maelle closed 8 years ago

maelle commented 8 years ago

Using packages to access metadata from articles one can look for keywords but I wonder if it would be possible to write code for:

The motivation behind that is e.g. to be able to say "Out of X articles about air pollution, Y studies were performed in Europe" (see https://twitter.com/sciencerely/status/476390715451142144 where the data was obtained by counting how many hits there were for "air pollution + city name" -> with location recognition it would be easier to do this systematically). One could also look at gender of authors (there's a R package for that if I recall it well) depending on countries/years.

I've started to look at Named Entity Recognition and text analysis (nothing I've put on Github now). I tend to think one couldn't be 100% sure the automatic "location analysis" is right, but I'm sure one could get a good "power".

sckott commented 8 years ago

@masalmon you working on this? be interested to chat about this

maelle commented 8 years ago

Nope, would you have been interested?

The code I'd written a while ago used this strategy:

It could get better, e.g.:

sckott commented 8 years ago

sorry! missed the notification for your message until now. I'm definitely interested in helping with this - let me know if you have time/want to work on this

mbacou commented 8 years ago

Just jumping here at random, but we've been using CLAVIN for the exact same purpose of geoparsing and geotagging scientific publications. CLAVIN parses documents against GeoNames. It works well as a good 1st-pass, manual corrections still necessary: https://clavin.bericotechnologies.com/

maelle commented 8 years ago

Oh this is very nice! Any suggestion on how we can use it from R, is there an API to clavin? I see clavin does more than Named Entity Recognition + geonames which I did in my draft code.

@sckott this tool + ropensci pkgs for accessing articles metadata & full text -> this could be fun.

I think missclassification is ok if small. @mbacou how high is it in your use cases? If one e.g. looks at proportion of some countries in abstracts over time for large numbers of articles, and if we expect missclassification to be the same on all years, a trend could still be found I guess.

maelle commented 8 years ago

@mbacou this is a very cool "random" comment!

mbacou commented 8 years ago

Miss-classification rates are very context-specific. We've only used it on fewer than 100 documents (but large, over 100pp each). There are typically 2 categories of errors, 1) bad context (often come in patterns), and 2) wrong geolocation (political units or place names in the wrong countries). Still very useful overall. Can be installed as a local RESTful service.

maelle commented 8 years ago

Ok thank you very much!

I'll have a look at it in the next week -> for R users it'd be best to have it as a R package so that everything can be done from R.

maelle commented 8 years ago

The github repo of cravin-rest hasn't been active for a year.

maelle commented 8 years ago

But there's a webservice now! https://github.com/edwardcapriolo/clavin-aas

sckott commented 8 years ago

Definitely would need a web service for that since it looks like it's Java - which is not worth wrapping locally

maelle commented 8 years ago

@sckott let's see what the creator of the webservice repo says. I'm not surprised that someone made a tool for geolocating things in text, it opens so many possibilities once you can do this automatically!

sckott commented 8 years ago

sounds good, yeah, does sound very cool

maelle commented 8 years ago

@sckott I opened an issue here https://github.com/Berico-Technologies/CLAVIN-rest/issues/11 Do you have any Java experience?

sckott commented 8 years ago

nope, no Java experience. Our other devs in ropensci don't either, and there's no good way to use Java from R, so that's why I say we'd have to have a web API for this

maelle commented 8 years ago

Ok but OpenNLP seems to use Java code? But it's not optimal then? Too bad.

maelle commented 8 years ago

Oh I see it does so with Apache, now I got it!

sckott commented 8 years ago

Well, there is an Rjava pkg - but its endless headaches, and hard to install, etc. C++ is a much better high performance compiled language to interface with - I've wanted to have better integration with the stanford parser, but its a Java project, so haven't worked on it :(

maelle commented 8 years ago

Ok so if there is no solution for CLAVIN, a less elaborate solution with OpenNLP and geonames will have to do for now.

sckott commented 8 years ago

a less elaborate solution with OpenNLP and geonames will have to do for now.

sounds good. Should we start a new repo, or do you have one already?

maelle commented 8 years ago

I dont have one so you can start one. For Named Entity Recognition I had this Jane Austen example. https://github.com/masalmon/janeausten/blob/master/analysis.R If I remember correctly in geonames sometimes you get several countries for a place so there were ambiguous results.

maelle commented 8 years ago

(or I will start one but tomorrow)

mbojan commented 8 years ago

I am not 100% sure, but perhaps you will find http://cermine.ceon.pl/index.html relevant. Affiliations get extracted. This is a webservice, the source code is here https://github.com/CeON/CERMINE

sckott commented 8 years ago

Ah, openNLP is a based on a java lib as well. Do you know if @kbenoit package https://github.com/kbenoit/quanteda can do the named entity recognition you talked about doing with opennlp?

sckott commented 8 years ago

@masalmon here's a repo, you can push and pull there https://github.com/ropensci/geolocart

kbenoit commented 8 years ago

@sckott Not yet on the NER but should be part of quanteda by the month's end. I'm basing it on spacy since I have experienced nothing but frustration in trying to get openNLP (and RJava specifically) to work on my Mac. (On Linux I can get it to work but only after a hack or two to the config.)

juliasilge commented 8 years ago

@kbenoit This makes me feel so much better because I have tried several times to get openNLP to work on my Mac with just no success whatsoever. Also, I have tried to get the Stanford coreNLP tools to work (Java as well), also without any success.

mbacou commented 8 years ago

Assuming CLAVIN-rest does work, including the Java service with a new R package should not be an issue (just place all Java resources under a ./inst or ./java/ folder in your R package and have the Java process start locally using .onLoad). Then you can use R curl or httr to communicate with the service. It does place a strong dependency on a JRE/JDK, but should not require messing with rJava... afaik

maelle commented 8 years ago

Thanks @sckott !

I had no idea OpenNLP was so hard to work with! Thank you @juliasilge and @kbenoit for these reports! Later today I'll push code using OpenNLP but we can switch to quanteda later.

@mbacou thanks but if we do this it seems it'll be a pain for e.g. Mac users I guess.

At some point I'll need to compare results of "our" solution with CLAVIN.I will do this on my Windows computer hehe.

maelle commented 8 years ago

@sckott I've put some things in the repo now. For this I only copy-pasted my old code (I had downloaded abstracts about PM2.5 exposure and tried to locate them but at that time I had not made a package, it was a messy loop, hehe). At the time I had got more questions than answers about how to locate articles so I'll be happy to discuss it!

sckott commented 8 years ago

great, thanks @masalmon

maelle commented 8 years ago

The (not very active yet) repo is https://github.com/ropenscilabs/geolocart. Closing this issue.