Get rid of wpm exploreArticle service dependency

graus commented 11 years ago

Fetching in- and outlinks of wikipedia pages slows the semanticizer down loads, since it http requests them from the exploreArticle service running on zookst13.

I suppose all data fetched by this service from the csv files should be in redis: (http://wikipedia-miner.cms.waikato.ac.nz/services/?exploreArticle)

IsaacHaze commented 11 years ago

Doe het lekker zelf =P

dodijk commented 11 years ago

Second part is a separate issue. Check why this is the case and fix or open new issue, @graus.

graus commented 11 years ago

What about instead of the in-memory redis db to store article data, supply a script that processes Wikipediaminer csv's and stores them (in a smart way -- something to think about) in a MongoDB. There would be a possibility to either run the db locally for speed, or we provide some (in the same spirit as the zookst13 exploreArticle thingy).

nucflash commented 11 years ago

Redis stores data in disk, and loads what is needed in memory. This is very similar to MongoDB behavior so it seems to put additional work (for changing the API) without adding much more value. Do you need more than a key-value store? If the answer is yes, then MongoDB might be something to consider.

dodijk commented 11 years ago

I agree with nucflash. Perhaps there are simpleways of converting the BerkeleyDB of the Wikipedia-Miner webservice into something more useful?

evertlammerts commented 11 years ago

I've looked at this a little while earlier and I think that's the way to go. Doesn't look simple though - wpminer generates the database through loads of abstract classes and interfaces, it'll take a while to untangle that.

On Fri, Jul 5, 2013 at 3:58 PM, dodijk notifications@github.com wrote:

I agree with nucflash. Perhaps there are simpleways of converting the BerkeleyDB of the Wikipedia-Miner webservice into something more useful?

— Reply to this email directly or view it on GitHubhttps://github.com/semanticize/semanticizer/issues/14#issuecomment-20520145 .

graus commented 11 years ago

@nucflash: Right, I didn't know the RedisDB was persistent. In this case it depends on what we want to store (Wikipedia article metadata) whether a key-value store suffices.

Maybe it would be good to have the full 'exploreArticle'-format in another redis (next to the anchor-ID mappings), storing articleIds with all available metadata (inlinks, outlinks, categories, definition?). I'll look at it when I need it, and hope @IsaacHaze will need it before me ;-).

dodijk commented 11 years ago

So: we can either change the Hadoop code to create a RedisDB or directly convert the BerkeleyDB to Redis.

I agree with @dgraus that we might want to have a separate DB for this, perhaps also take out the translation and just keep anchors and titles in the main DB.

I currently don't need this either. ;-)

Op 5 jul. 2013, om 16:10 heeft David notifications@github.com het volgende geschreven:

Right, I didn't know the RedisDB was persistent. In this case it depends on what we want to store (Wikipedia article metadata) whether a key-value store suffices.

Maybe it would be good to have the full 'exploreArticle'-format in another redis (next to the anchor-ID mappings), storing articleIds with all available metadata (inlinks, outlinks, categories, definition?). I'll look at it when I need it, and hope @IsaacHaze will need it before me ;-).

— Reply to this email directly or view it on GitHub.

graus commented 11 years ago

And why not parsing the CSVs instead of converting the BerkeleyDB? I did that for in/outlinks before, shouldn't be too hard for the other stuff...?

(PS I don't know what's in the BerkeleyDB)

dodijk commented 11 years ago

Because I don't think everything is in the CSVs. E.g. definitions: http://zookst13.science.uva.nl:8080/dutchsemcor/article?title=Los%20Angeles&definition=true.

Op 5 jul. 2013, om 16:22 heeft David notifications@github.com het volgende geschreven:

And why not parsing the CSVs instead of converting the BerkeleyDB? I did that for in/outlinks before, shouldn't be too hard for the other stuff...?

— Reply to this email directly or view it on GitHub.

dodijk commented 11 years ago

BTW, I tried opening the BerkeleyDB from Python, but ran into issues. I think this is because they use Java Edition of BDB. See: http://www.oracle.com/technetwork/products/berkeleydb/overview/index-093405.html.

graus commented 10 years ago

Solution for now: store all data we get from WikipediaMiner article service in redisDB, so:

write scripts that extract from WPM CSV and store in redisDB, as is being done for e.g. labels now
figure out what to do with the content coming from BerkeleyDB (e.g. abstract)

graus commented 10 years ago

I am going to start working on dumping in & outlinks to redis...

semanticize / semanticizer

Get rid of wpm exploreArticle service dependency #14