Closed frnsys closed 10 years ago
this is basically what the digester
module is intended for. suggested rough approach:
util.gullet
downloads these data sources, so we can quickly work with local copiescore.digester
processes/imports these data sources into a database for lookups. they should be processed in some standardized way (although RDF may be standardized enough)util.gullet
checks for updates to these data sourcesCool project. Have you considered using AlchemyAPI for entity extraction? They disambiguate to DBpedia/YAGO/Freebase. Also if RDF is giving you too much headache you can use the newly standardized JSON-LD format instead.
thanks for the tip! I've looked into AlchemyAPI but, at least for this prototype phase, I want to see how far I can get with open source technology.
thus far RDF has been pretty nice to work with. currently it looks like I'll be using DBpedia datasets with Apache's Jena which makes it quite easy to load up datasets and interface with them via HTTP.
I hadn't heard of JSON-LD though, looks very interesting/nice...I'll keep an eye on it as it develops.
That makes sense, and AlchemyAPI can only do 1000 calls/day free, so it would start costing quite a bit down the road.
If you don't mind my asking, what kinds of entities are you gathering from the DBpedia datasets? Also I hadn't heard of Jena, it looks pretty sweet. It looks like there's a node module for it in development which is awesome.
And yea, JSON-LD is going to make it a lot easier to serve standardized data from API endpoints using the @ context.
right now the DBpedia datasets I'm grabbing are the labels
, short_abstracts
, long_abstracts
, images
, redirects
, and geo_coordinates
. labels
for getting URIs by name and redirects
for resolving common misspellings or alternate spellings, and then the others since I'm interested in getting summaries, images, and coordinates for things (to provide descriptions/image/plot them on maps)
Sounds like we're trying to solve a lot of the same problems...I just released an alpha version of a chrome extension I've been working on to give you one-click access to data about politicians. I'm pulling the thumbnail
, abstract
, comment
, website
, opensecrets
, bioguide
, and votesmart
from DBpedia for each politician. I'll display the thumb/comment/website directly, but then use the identifiers to grab more data from APIs.
How does Argos actually connect entities in text to DBpedia though? That seems like the tricky bit. I'd love to not use AlchemyAPI anymore if possible. I just found a node.js Stanford NLP library, so I'm hopeful.
wow cool project! I'm using Stanford NER, which I think is included in the Stanford NLP library, with NLTK as a fallback/alternative. There might be an NLTK equivalent for node? Stanford NER has worked ok so far but not great, though I haven't toyed with it much yet. It has been better than entity recognition with NLTK though. I think in terms of quality, AlchemyAPI may still be the one to beat!
for infos, need to extract some summary/blurb and an image about that entity and return it. using wikipedia or RDF/semantic web:
[DL] = downloadable
Transparency Index http://cpi.transparency.org/cpi2011/results/ Poligraft http://poligraft.com/ Charity Navigator NewsTrust http://newstrust.net/ Good Guide SunlightLabs' Open States http://openstates.org/ Wikipedia [DL] Scholarpedia YAGO http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html [DL] Geonames http://download.geonames.org/export/dump/ [DL] US Census http://www.census.gov/developers/ [DL] Freebase https://developers.google.com/freebase/data [DL] OpenCyc http://www.cyc.com/platform/opencyc [DL] DBPedia http://wiki.dbpedia.org/Downloads39 [DL]
Umbel? https://github.com/structureddynamics/umbel [DL]