Closed bartsidee closed 10 years ago
Thanks, @bartsidee, has been an issue for a while, issue #14. @graus was looking into this, so I'll assign the pull request to him for now.
I think this also solves issue #35
I added a few bug-fixes and performance improvements.
Notice that for now I disabled parsing of article definitions and fetching the (in/out)link titles+relatedness as they are not used by the current codebase. This has a few advantages:
This might impact slightly the information available when calling the semanticizer with the "article" param. The function are available and could be enabled. (no dependency on wpm web-service for that)
Thanks, Bart. Making loading of the (in/out)link titles+relatedness optional makes sense, like we also do for translations. @dgraus was checking out the code just now, your master branch, not just this pull request.
ik heb de branches even gemerged voor het overzicht, helaas wat duplicate commits omdat graus mijn master branch had gepakt (ipv de wpm-refactor) en ik de changes had met cherrypick had geselecteerd.
Overigens het enige verschil is dat ik inmijn master branch ook nieuwe wsgi adapter had geschreven die de volledige service doorgeeft en compatible was met apache.
git rebase -i
is your friend. Fire it up, "squash" the duplicate commits, then push -f
the results to this branch.
Het lijkt (leek) me makkelijker als je verder ging op de branch die ik heb aangemaakt @bartsidee, tenzij jij nog een hele zwik commits had klaarstaan, maar die comments van @larsmans zijn makkelijker in de nieuwe branch aan te passen dan andersom wellicht.
ok beetje opgeschoond, beetje miscommunicatie wat dat betreft, maar moet nu weer op 1 lijn zitten. Het lijkt er alleen op dat de re-base ervoor heeft gezorgd dat @larsmans zijn comments zijn gewist, sorry daarvoor. Gelukkig had ik zijn feedback al verwerkt :)
Ik heb verder geen wijzigingen meer. Mijn laatste commit maakt het mogelijk om de nu missende article informatie optioneel toe te voegen.:
Super, thanks! Ik ben op de achtergrond een redis server aan 't vullen voor een laatste test, zodra die rond is kunnen we hem wat mij betreft mergen! :+1:
Since the last update I get 2 errors; first the cleanup of the DB seems to be very thorough:
[110768] 08 Jan 11:43:25 - DB 0: 193201 keys (0 volatile) in 2097152 slots HT.
[110768] 08 Jan 11:43:25 - 1 clients connected (0 slaves), 60933384 bytes in use
[110768] 08 Jan 11:43:30 - Client closed connection
[110768] 08 Jan 11:43:30 - DB 0: 2 keys (0 volatile) in 8192 slots HT.
[110768] 08 Jan 11:43:30 - 0 clients connected (0 slaves), 5182856 bytes in use
[110768] 08 Jan 11:43:35 - DB 0: 2 keys (0 volatile) in 4 slots HT.
[110768] 08 Jan 11:43:35 - 0 clients connected (0 slaves), 5182856 bytes in use
[110768] 08 Jan 11:43:39 * 10000 changes in 60 seconds. Saving...
[110768] 08 Jan 11:43:39 * Background saving started by pid 27098
[27098] 08 Jan 11:43:39 * DB saved on disk
[110768] 08 Jan 11:43:39 * Background saving terminated with success
[110768] 08 Jan 11:43:40 - DB 0: 2 keys (0 volatile) in 4 slots HT.
[110768] 08 Jan 11:43:40 - 0 clients connected (0 slaves), 5182856 bytes in use
i.e. it goes from 193201 keys to 2 (Cleanup is done @ "Client closed connection"). And for some reason I cannot run the server anymore;
Traceback (most recent call last):
File "/datastore/dgraus/semanticizer/semanticizer/server/__main__.py", line 17, in <module>
from .. import procpipeline
ValueError: Attempted relative import in non-package
But this didn't change, right? Could be me or my python...
Hi Graus,
It looks like you try to run the server as non package, did you try: python -m semanticizer.server
Regarding the redis import, the general idea now is that an insert (I refer to dataset) can be imported dynamically, which would prevent possible downtime of a production server.
So all keys of a dataset are prefix with a version number e.g. "nl:1", when a new dataset is inserted, the server adds the dataset as new version, e.g. with prefix "nl:2". The semanticizer dynamically checks this version number which is stored in the key "nl:db:version". At the end of a insert job the nl:db:version is incremented and the semanticizer automatically switches to the new dataset.
After that indeed a cleanup job is started to remove the old dataset, it will delete all keys prefixed with previous dataset version, e.g. "nl:1*". A full dataset can have a lot of keys, e.g. my NL dataset has currently 18297030 keys 3.5 Gb in redis memory. So it is correct that it would purge quite some keys.
I still need to fully test the versioning by running a 2nd insert while a 1st dataset is already in redis. So it might be that you hit a bug, but I could not detect any basedmon the code.
Notice that redis config has some default settings that BGSAVE you db after specific amount of keys have been changed. During a BGSAVE the complete instance is forked and thus taking up twice the amount of memory on the server, which in theory could lockup you redis service if the server has not enough physical memory. This can be easily solved by adjusting the redis configuration.
Thanks for the reply @bartsidee, I indeed run it with python -m semanticizer.server
And regarding the redis: I see how it works now, so the number of keys are expected (as is it reporting there's ~5MB in use? This was >3.5 GB before the cleanup too if I remember correctly).
Anyway, I can't test currently with weird error I get. I'll see if I can bug @dodijk about it :-).
Fixed @graus's weird error. New package structure does not like Flask's reloading. Solution: disable reload in config.
well usually you should always have at least 1 dataset in redis (if you not alter it outside the insert function), so during the first insert 3,5Gb then during the 2nd import temporary max 7Gb, but it should drop to 3.5Gb again when it has deleted the first dataset. 5Mb after an insert would be kind of strange. Was this the result of a dbinsert run?
Yes, happened both with a single dump ingest and with a double (i.e. nl + en)
I performed already 2 successful single inserts on 2 different environments, but maybe that was before 2fe4b3f, which redis version do you use (i'm on 2.6)
Could you maybe verify the output of the below variables:
self.version
oldversion = self.db.get(self.ns.db_version())
/semanticizer/wpm/load.py#L82
oldkeys = self.db.keys(langcode+":"+oldversion+"*")
/semanticizer/wpm/load.py#L91
As for a first run (no version value/key present) the deletion of keys does not apply and should be blocked by:
if oldversion:
and for a second run it should only query keys with the old version prefix.
It seems something strange is going on....
Found it, it was the version number, was was fetched incorrectly in the load script, as such the version was always 0 and the newly inserted set would be deleted as both old and new version was "0".
Fixed in 2bf95a4
Great, it works perfect now with the redis db! Next up, in-memory storage ;-).
If I understand correctly, the in-memory implementation mimics a db, right? But there is no "dbinsert" equivalent? (at least in /dbinsert/main.py there's no check for "memory" as datasource).
If I simply change the config file from "redis" to "memory", I get this error:
File "/datastore/dgraus/epd-7.3-2/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/datastore/dgraus/epd-7.3-2/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/datastore/dgraus/wpm-refactor/semanticizer/server/__main__.py", line 95, in <module>
main()
File "/datastore/dgraus/wpm-refactor/semanticizer/server/__main__.py", line 76, in main
init_datasource(wpmlangs, settings)
File "semanticizer/wpm/data.py", line 26, in init_datasource
load_wpm_data(langconfig['source'], langcode, settings, **langconfig['initparams'])
File "semanticizer/wpm/data.py", line 42, in load_wpm_data
wpm_dumps[langcode] = WpmData(db, langcode)
File "semanticizer/wpm/data.py", line 53, in __init__
self.version = self.db.get(langcode+":version")
File "semanticizer/wpm/db/inmemory.py", line 33, in get
return self.cache[key]
KeyError: 'nl:version'
I assume there has to be something in the cache first, but how?
David
yes correct, the in memory version mimics the database version.
Personally I do not quite prefer it as it does not scale in production, e.g. each deamon will have to load the complete dataset in memory where with redis (or other db) multiple instances of the code can access the same data.
Currently the dataset is "inserted" into memory when the service starts, which might take some time to complete and you probably can not directly fire requests at it before it is completed, (open for future improvements to do this quicker) so it is correct that the memory db is absent in the dbinsert script.
You hit an error on the db version key, which is only set at the end of the load (insertion) process. Could it be that you fired a request before the service was fully loaded?
@bartsidee Any ballpark figures for the loading time and memory use?
I assume this will be faster then redis insert but I guess could still take anywhere from 15 till 45min depending on the system.
Redis dataset with min insert semanticizer config was arround 3.4 GB for NL. But redis is quite efficient in memory management so python memory could well top that value due to the overhead from classes (dict/list) and this is per instance.
To speedup the process one could create a similar dbinsert for in memory to write the complete dataset to a format which is much more efficient to read e.g. cpickle or better maybe optimised serialisation tools like msgpack: http://msgpack.org This could improve loading time, but not directly the memory footprint.
In the end one could ask if loading the dataset in python memory would still be benificitial over a memory db like redis e.g. with the latest code version and a redis setup our response time is below 100ms.
Ok. I've been looking at good in-memory string data structures recently (for another project) and I suppose storing large numbers of n-grams can be done efficiently using DAtries or MARISA-tries. There are good implementations for Python available. I'm not sure how fast they would be.
15m -- 45m? Yes I can concur that starting a semanticizer is just about the slowest part of my processing pipeline, but if i recall correctly then it's more in the 2m -- 5m range for the Dutch Wikipedia (but that's a from memory ballpark figure, i haven't timed it.)
MARISA looks nice, but maybe DAWG will be a better match for the semanticizer's use-case, with some of the ngrams having four or five copies in memory and serialized to disk (each with minor variations to handle accent/space/dash normalization.)
The reason I mentioned MARISA is that there's a good Python lib that implements it. If you know a good DAWG or DAFSA lib, keep me posted, I've been looking for one without any success.
Ehh... DAWG? From the same Mikhail Korobov that worked on/did MARISA?
The only reason i know about it is because it is linked to from the MARISA page. DAWG lives at: https://github.com/kmike/DAWG
and has docs at: http://dawg.readthedocs.org/en/latest/
I don't know if it's any good though...
IIRC, that lib needs to have the entire set of strings in memory before it builds an automaton out of them. I'm not sure, though, and if loading is not the problem, we can experiment with it.
(The same might be true for MARISA, though.)
I was wrong, it can actually read from an iterable:
>>> from dawg import DAWG
>>> def gen():
... yield 'foo'
... yield 'bar'
...
>>> d = DAWG(gen())
The wpm module has been refactored to remove the dependency on the wiki-miner web service. Instead the information is directly loaded and calculated from the csv files.
Changes:
Notes: