Wpm module refactor - Githubissues

bartsidee commented 10 years ago

The wpm module has been refactored to remove the dependency on the wiki-miner web service. Instead the information is directly loaded and calculated from the csv files.

Changes:

remove dependency on wiki-miner web service, retrieve the data directly from memory or redis
update csv load script to include missing data
some general improvements in names and class structures of wpm module
added db versioning, as such that one could import a new dataset while the service is running.
updated config files to support code changes

Notes:

needs some more testing with in memory module.

dodijk commented 10 years ago

Thanks, @bartsidee, has been an issue for a while, issue #14. @graus was looking into this, so I'll assign the pull request to him for now.

bartsidee commented 10 years ago

I think this also solves issue #35

bartsidee commented 10 years ago

I added a few bug-fixes and performance improvements.

Notice that for now I disabled parsing of article definitions and fetching the (in/out)link titles+relatedness as they are not used by the current codebase. This has a few advantages:

saves RAM as it is not needed to keep all article definitions in memory/redis
reduces the amount of calls to the dataset (redis/in memory) and calculations during call processing

This might impact slightly the information available when calling the semanticizer with the "article" param. The function are available and could be enabled. (no dependency on wpm web-service for that)

dodijk commented 10 years ago

Thanks, Bart. Making loading of the (in/out)link titles+relatedness optional makes sense, like we also do for translations. @dgraus was checking out the code just now, your master branch, not just this pull request.

bartsidee commented 10 years ago

ik heb de branches even gemerged voor het overzicht, helaas wat duplicate commits omdat graus mijn master branch had gepakt (ipv de wpm-refactor) en ik de changes had met cherrypick had geselecteerd.

Overigens het enige verschil is dat ik inmijn master branch ook nieuwe wsgi adapter had geschreven die de volledige service doorgeeft en compatible was met apache.

larsmans commented 10 years ago

git rebase -i is your friend. Fire it up, "squash" the duplicate commits, then push -f the results to this branch.

graus commented 10 years ago

Het lijkt (leek) me makkelijker als je verder ging op de branch die ik heb aangemaakt @bartsidee, tenzij jij nog een hele zwik commits had klaarstaan, maar die comments van @larsmans zijn makkelijker in de nieuwe branch aan te passen dan andersom wellicht.

bartsidee commented 10 years ago

ok beetje opgeschoond, beetje miscommunicatie wat dat betreft, maar moet nu weer op 1 lijn zitten. Het lijkt er alleen op dat de re-base ervoor heeft gezorgd dat @larsmans zijn comments zijn gewist, sorry daarvoor. Gelukkig had ik zijn feedback al verwerkt :)

Ik heb verder geen wijzigingen meer. Mijn laatste commit maakt het mogelijk om de nu missende article informatie optioneel toe te voegen.:

(in/outlink) title en relatedness
parent categories
definitions

graus commented 10 years ago

Super, thanks! Ik ben op de achtergrond een redis server aan 't vullen voor een laatste test, zodra die rond is kunnen we hem wat mij betreft mergen! :+1:

graus commented 10 years ago

Since the last update I get 2 errors; first the cleanup of the DB seems to be very thorough:

[110768] 08 Jan 11:43:25 - DB 0: 193201 keys (0 volatile) in 2097152 slots HT.
[110768] 08 Jan 11:43:25 - 1 clients connected (0 slaves), 60933384 bytes in use
[110768] 08 Jan 11:43:30 - Client closed connection
[110768] 08 Jan 11:43:30 - DB 0: 2 keys (0 volatile) in 8192 slots HT.
[110768] 08 Jan 11:43:30 - 0 clients connected (0 slaves), 5182856 bytes in use
[110768] 08 Jan 11:43:35 - DB 0: 2 keys (0 volatile) in 4 slots HT.
[110768] 08 Jan 11:43:35 - 0 clients connected (0 slaves), 5182856 bytes in use
[110768] 08 Jan 11:43:39 * 10000 changes in 60 seconds. Saving...
[110768] 08 Jan 11:43:39 * Background saving started by pid 27098
[27098] 08 Jan 11:43:39 * DB saved on disk
[110768] 08 Jan 11:43:39 * Background saving terminated with success
[110768] 08 Jan 11:43:40 - DB 0: 2 keys (0 volatile) in 4 slots HT.
[110768] 08 Jan 11:43:40 - 0 clients connected (0 slaves), 5182856 bytes in use

i.e. it goes from 193201 keys to 2 (Cleanup is done @ "Client closed connection"). And for some reason I cannot run the server anymore;

Traceback (most recent call last):
  File "/datastore/dgraus/semanticizer/semanticizer/server/__main__.py", line 17, in <module>
    from .. import procpipeline
ValueError: Attempted relative import in non-package

But this didn't change, right? Could be me or my python...

bartsidee commented 10 years ago

Hi Graus,

It looks like you try to run the server as non package, did you try: python -m semanticizer.server

Regarding the redis import, the general idea now is that an insert (I refer to dataset) can be imported dynamically, which would prevent possible downtime of a production server.

So all keys of a dataset are prefix with a version number e.g. "nl:1", when a new dataset is inserted, the server adds the dataset as new version, e.g. with prefix "nl:2". The semanticizer dynamically checks this version number which is stored in the key "nl:db:version". At the end of a insert job the nl:db:version is incremented and the semanticizer automatically switches to the new dataset.

After that indeed a cleanup job is started to remove the old dataset, it will delete all keys prefixed with previous dataset version, e.g. "nl:1*". A full dataset can have a lot of keys, e.g. my NL dataset has currently 18297030 keys 3.5 Gb in redis memory. So it is correct that it would purge quite some keys.

I still need to fully test the versioning by running a 2nd insert while a 1st dataset is already in redis. So it might be that you hit a bug, but I could not detect any basedmon the code.

Notice that redis config has some default settings that BGSAVE you db after specific amount of keys have been changed. During a BGSAVE the complete instance is forked and thus taking up twice the amount of memory on the server, which in theory could lockup you redis service if the server has not enough physical memory. This can be easily solved by adjusting the redis configuration.

graus commented 10 years ago

Thanks for the reply @bartsidee, I indeed run it with python -m semanticizer.server

And regarding the redis: I see how it works now, so the number of keys are expected (as is it reporting there's ~5MB in use? This was >3.5 GB before the cleanup too if I remember correctly).

Anyway, I can't test currently with weird error I get. I'll see if I can bug @dodijk about it :-).

dodijk commented 10 years ago

Fixed @graus's weird error. New package structure does not like Flask's reloading. Solution: disable reload in config.

bartsidee commented 10 years ago

well usually you should always have at least 1 dataset in redis (if you not alter it outside the insert function), so during the first insert 3,5Gb then during the 2nd import temporary max 7Gb, but it should drop to 3.5Gb again when it has deleted the first dataset. 5Mb after an insert would be kind of strange. Was this the result of a dbinsert run?

graus commented 10 years ago

Yes, happened both with a single dump ingest and with a double (i.e. nl + en)

bartsidee commented 10 years ago

I performed already 2 successful single inserts on 2 different environments, but maybe that was before 2fe4b3f, which redis version do you use (i'm on 2.6)

Could you maybe verify the output of the below variables:

self.version
oldversion = self.db.get(self.ns.db_version())

/semanticizer/wpm/load.py#L82

oldkeys = self.db.keys(langcode+":"+oldversion+"*")

/semanticizer/wpm/load.py#L91

As for a first run (no version value/key present) the deletion of keys does not apply and should be blocked by:

if oldversion:

and for a second run it should only query keys with the old version prefix.

It seems something strange is going on....

bartsidee commented 10 years ago

Found it, it was the version number, was was fetched incorrectly in the load script, as such the version was always 0 and the newly inserted set would be deleted as both old and new version was "0".

Fixed in 2bf95a4

graus commented 10 years ago

Great, it works perfect now with the redis db! Next up, in-memory storage ;-).

If I understand correctly, the in-memory implementation mimics a db, right? But there is no "dbinsert" equivalent? (at least in /dbinsert/main.py there's no check for "memory" as datasource).

If I simply change the config file from "redis" to "memory", I get this error:

  File "/datastore/dgraus/epd-7.3-2/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/datastore/dgraus/epd-7.3-2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/datastore/dgraus/wpm-refactor/semanticizer/server/__main__.py", line 95, in <module>
    main()
  File "/datastore/dgraus/wpm-refactor/semanticizer/server/__main__.py", line 76, in main
    init_datasource(wpmlangs, settings)
  File "semanticizer/wpm/data.py", line 26, in init_datasource
    load_wpm_data(langconfig['source'], langcode, settings, **langconfig['initparams'])
  File "semanticizer/wpm/data.py", line 42, in load_wpm_data
    wpm_dumps[langcode] = WpmData(db, langcode)
  File "semanticizer/wpm/data.py", line 53, in __init__
    self.version = self.db.get(langcode+":version")
  File "semanticizer/wpm/db/inmemory.py", line 33, in get
    return self.cache[key]
KeyError: 'nl:version'

I assume there has to be something in the cache first, but how?

David

bartsidee commented 10 years ago

yes correct, the in memory version mimics the database version.

Personally I do not quite prefer it as it does not scale in production, e.g. each deamon will have to load the complete dataset in memory where with redis (or other db) multiple instances of the code can access the same data.

Currently the dataset is "inserted" into memory when the service starts, which might take some time to complete and you probably can not directly fire requests at it before it is completed, (open for future improvements to do this quicker) so it is correct that the memory db is absent in the dbinsert script.

You hit an error on the db version key, which is only set at the end of the load (insertion) process. Could it be that you fired a request before the service was fully loaded?

larsmans commented 10 years ago

@bartsidee Any ballpark figures for the loading time and memory use?

bartsidee commented 10 years ago

I assume this will be faster then redis insert but I guess could still take anywhere from 15 till 45min depending on the system.

Redis dataset with min insert semanticizer config was arround 3.4 GB for NL. But redis is quite efficient in memory management so python memory could well top that value due to the overhead from classes (dict/list) and this is per instance.

To speedup the process one could create a similar dbinsert for in memory to write the complete dataset to a format which is much more efficient to read e.g. cpickle or better maybe optimised serialisation tools like msgpack: http://msgpack.org This could improve loading time, but not directly the memory footprint.

In the end one could ask if loading the dataset in python memory would still be benificitial over a memory db like redis e.g. with the latest code version and a redis setup our response time is below 100ms.

larsmans commented 10 years ago

Ok. I've been looking at good in-memory string data structures recently (for another project) and I suppose storing large numbers of n-grams can be done efficiently using DAtries or MARISA-tries. There are good implementations for Python available. I'm not sure how fast they would be.

IsaacHaze commented 10 years ago

15m -- 45m? Yes I can concur that starting a semanticizer is just about the slowest part of my processing pipeline, but if i recall correctly then it's more in the 2m -- 5m range for the Dutch Wikipedia (but that's a from memory ballpark figure, i haven't timed it.)

MARISA looks nice, but maybe DAWG will be a better match for the semanticizer's use-case, with some of the ngrams having four or five copies in memory and serialized to disk (each with minor variations to handle accent/space/dash normalization.)

larsmans commented 10 years ago

The reason I mentioned MARISA is that there's a good Python lib that implements it. If you know a good DAWG or DAFSA lib, keep me posted, I've been looking for one without any success.

IsaacHaze commented 10 years ago

Ehh... DAWG? From the same Mikhail Korobov that worked on/did MARISA?

The only reason i know about it is because it is linked to from the MARISA page. DAWG lives at: https://github.com/kmike/DAWG

and has docs at: http://dawg.readthedocs.org/en/latest/

I don't know if it's any good though...

larsmans commented 10 years ago

IIRC, that lib needs to have the entire set of strings in memory before it builds an automaton out of them. I'm not sure, though, and if loading is not the problem, we can experiment with it.

(The same might be true for MARISA, though.)

larsmans commented 10 years ago

I was wrong, it can actually read from an iterable:

>>> from dawg import DAWG
>>> def gen():
...     yield 'foo'
...     yield 'bar'
...     
>>> d = DAWG(gen())

semanticize / semanticizer

Wpm module refactor #39