Open graus opened 11 years ago
Doe het lekker zelf =P
Second part is a separate issue. Check why this is the case and fix or open new issue, @graus.
What about instead of the in-memory redis db to store article data, supply a script that processes Wikipediaminer csv's and stores them (in a smart way -- something to think about) in a MongoDB. There would be a possibility to either run the db locally for speed, or we provide some (in the same spirit as the zookst13 exploreArticle thingy).
Redis stores data in disk, and loads what is needed in memory. This is very similar to MongoDB behavior so it seems to put additional work (for changing the API) without adding much more value. Do you need more than a key-value store? If the answer is yes, then MongoDB might be something to consider.
I agree with nucflash. Perhaps there are simpleways of converting the BerkeleyDB of the Wikipedia-Miner webservice into something more useful?
I've looked at this a little while earlier and I think that's the way to go. Doesn't look simple though - wpminer generates the database through loads of abstract classes and interfaces, it'll take a while to untangle that.
On Fri, Jul 5, 2013 at 3:58 PM, dodijk notifications@github.com wrote:
I agree with nucflash. Perhaps there are simpleways of converting the BerkeleyDB of the Wikipedia-Miner webservice into something more useful?
— Reply to this email directly or view it on GitHubhttps://github.com/semanticize/semanticizer/issues/14#issuecomment-20520145 .
@nucflash: Right, I didn't know the RedisDB was persistent. In this case it depends on what we want to store (Wikipedia article metadata) whether a key-value store suffices.
Maybe it would be good to have the full 'exploreArticle'-format in another redis (next to the anchor-ID mappings), storing articleIds with all available metadata (inlinks, outlinks, categories, definition?). I'll look at it when I need it, and hope @IsaacHaze will need it before me ;-).
So: we can either change the Hadoop code to create a RedisDB or directly convert the BerkeleyDB to Redis.
I agree with @dgraus that we might want to have a separate DB for this, perhaps also take out the translation and just keep anchors and titles in the main DB.
I currently don't need this either. ;-)
Op 5 jul. 2013, om 16:10 heeft David notifications@github.com het volgende geschreven:
Right, I didn't know the RedisDB was persistent. In this case it depends on what we want to store (Wikipedia article metadata) whether a key-value store suffices.
Maybe it would be good to have the full 'exploreArticle'-format in another redis (next to the anchor-ID mappings), storing articleIds with all available metadata (inlinks, outlinks, categories, definition?). I'll look at it when I need it, and hope @IsaacHaze will need it before me ;-).
— Reply to this email directly or view it on GitHub.
And why not parsing the CSVs instead of converting the BerkeleyDB? I did that for in/outlinks before, shouldn't be too hard for the other stuff...?
(PS I don't know what's in the BerkeleyDB)
Because I don't think everything is in the CSVs. E.g. definitions: http://zookst13.science.uva.nl:8080/dutchsemcor/article?title=Los%20Angeles&definition=true.
Op 5 jul. 2013, om 16:22 heeft David notifications@github.com het volgende geschreven:
And why not parsing the CSVs instead of converting the BerkeleyDB? I did that for in/outlinks before, shouldn't be too hard for the other stuff...?
— Reply to this email directly or view it on GitHub.
BTW, I tried opening the BerkeleyDB from Python, but ran into issues. I think this is because they use Java Edition of BDB. See: http://www.oracle.com/technetwork/products/berkeleydb/overview/index-093405.html.
Solution for now: store all data we get from WikipediaMiner article service in redisDB, so:
I am going to start working on dumping in & outlinks to redis...
Fetching in- and outlinks of wikipedia pages slows the semanticizer down loads, since it http requests them from the exploreArticle service running on zookst13.
I suppose all data fetched by this service from the csv files should be in redis: (http://wikipedia-miner.cms.waikato.ac.nz/services/?exploreArticle)