Open lucasb-eyer opened 12 years ago
oops warum hab ich alles auf englisch geschrieben :D
können gerne alles auf englisch schreiben, falls wir das repro später public machen oder mit anderen zusammen arbeiten wollen.
I don't remember the file format name we saw on the conference,
HDF5, python library: h5py
. Though I'd argue that it's not that bad: load the dict into memory once (wait 2min)
I thought it doesn't fit in memory on a regular machine.
I can't load the product cataloge
catalog = load_product_catalog(fname="/home/mane/kaggle/cprod1/TrainingSet/products.json")
Traceback (most recent call last):
File "/home/mane/virt_env/scikit-learn/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2538, in run_code
exec code_obj in self.user_global_ns, self.user_ns
File "<ipython-input-1-d2d6213f21bf>", line 1, in <module>
catalog = load_product_catalog(fname="/home/mane/kaggle/cprod1/TrainingSet/products.json")
File "/home/mane/git/kaggle/cprod1_dataloading.py", line 8, in load_product_catalog
return json.loads(open(fname, 'r').read())['Product']
File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
MemoryError
You proposed to change the file format of the data.
I just tried pickling the products db. I thought it would be fastest, but it isn't:
against json:
against dd (max possible raw speed):
I don't remember the file format name we saw on the conference, we might try that one. Though I'd argue that it's not that bad: load the dict into memory once (wait 2min) and use it until you reboot the machine, instead of spending ages optimizing the loadtime.