api needs to collect entity info

frnsys commented 10 years ago

for infos, need to extract some summary/blurb and an image about that entity and return it. using wikipedia or RDF/semantic web:

[DL] = downloadable

Transparency Index http://cpi.transparency.org/cpi2011/results/ Poligraft http://poligraft.com/ Charity Navigator NewsTrust http://newstrust.net/ Good Guide SunlightLabs' Open States http://openstates.org/ Wikipedia [DL] Scholarpedia YAGO http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html [DL] Geonames http://download.geonames.org/export/dump/ [DL] US Census http://www.census.gov/developers/ [DL] Freebase https://developers.google.com/freebase/data [DL] OpenCyc http://www.cyc.com/platform/opencyc [DL] DBPedia http://wiki.dbpedia.org/Downloads39 [DL]

Umbel? https://github.com/structureddynamics/umbel [DL]

frnsys commented 10 years ago

this is basically what the digester module is intended for. suggested rough approach:

util.gullet downloads these data sources, so we can quickly work with local copies
core.digester processes/imports these data sources into a database for lookups. they should be processed in some standardized way (although RDF may be standardized enough)
at some set interval, util.gullet checks for updates to these data sources
when articles are processed, upon a new entity discovery, search the database for existing facts/info about the entity and construct a summary

ameensol commented 10 years ago

Cool project. Have you considered using AlchemyAPI for entity extraction? They disambiguate to DBpedia/YAGO/Freebase. Also if RDF is giving you too much headache you can use the newly standardized JSON-LD format instead.

frnsys commented 10 years ago

thanks for the tip! I've looked into AlchemyAPI but, at least for this prototype phase, I want to see how far I can get with open source technology.

thus far RDF has been pretty nice to work with. currently it looks like I'll be using DBpedia datasets with Apache's Jena which makes it quite easy to load up datasets and interface with them via HTTP.

I hadn't heard of JSON-LD though, looks very interesting/nice...I'll keep an eye on it as it develops.

ameensol commented 10 years ago

That makes sense, and AlchemyAPI can only do 1000 calls/day free, so it would start costing quite a bit down the road.

If you don't mind my asking, what kinds of entities are you gathering from the DBpedia datasets? Also I hadn't heard of Jena, it looks pretty sweet. It looks like there's a node module for it in development which is awesome.

And yea, JSON-LD is going to make it a lot easier to serve standardized data from API endpoints using the @ context.

frnsys commented 10 years ago

right now the DBpedia datasets I'm grabbing are the labels, short_abstracts, long_abstracts, images, redirects, and geo_coordinates. labels for getting URIs by name and redirects for resolving common misspellings or alternate spellings, and then the others since I'm interested in getting summaries, images, and coordinates for things (to provide descriptions/image/plot them on maps)

ameensol commented 10 years ago

Sounds like we're trying to solve a lot of the same problems...I just released an alpha version of a chrome extension I've been working on to give you one-click access to data about politicians. I'm pulling the thumbnail, abstract, comment, website, opensecrets, bioguide, and votesmart from DBpedia for each politician. I'll display the thumb/comment/website directly, but then use the identifiers to grab more data from APIs.

How does Argos actually connect entities in text to DBpedia though? That seems like the tricky bit. I'd love to not use AlchemyAPI anymore if possible. I just found a node.js Stanford NLP library, so I'm hopeful.

frnsys commented 10 years ago

wow cool project! I'm using Stanford NER, which I think is included in the Stanford NLP library, with NLTK as a fallback/alternative. There might be an NLTK equivalent for node? Stanford NER has worked ok so far but not great, though I haven't toyed with it much yet. It has been better than entity recognition with NLTK though. I think in terms of quality, AlchemyAPI may still be the one to beat!

publicscience / argos

api needs to collect entity info #46