ibayer / kaggle_old

0 stars 0 forks source link

File format #1

Open lucasb-eyer opened 12 years ago

lucasb-eyer commented 12 years ago

You proposed to change the file format of the data.

I just tried pickling the products db. I thought it would be fastest, but it isn't:

In [2]: %time prods = cPickle.load(open('/hpcwork/lb380118/TrainingSet/products.pickle', 'r'))
CPU times: user 308.80 s, sys: 14.52 s, total: 323.32 s
Wall time: 323.49 s

against json:

In [4]: %time prods = viz.load_product_catalog('/hpcwork/lb380118/TrainingSet/products.json')
CPU times: user 118.28 s, sys: 8.78 s, total: 127.06 s
Wall time: 127.68 s

against dd (max possible raw speed):

$ dd if=/hpcwork/lb380118/TrainingSet/products.json of=/dev/null
3721307+1 records in
3721307+1 records out
1905309345 bytes (1.9 GB) copied, 11.0042 s, 173 MB/s

I don't remember the file format name we saw on the conference, we might try that one. Though I'd argue that it's not that bad: load the dict into memory once (wait 2min) and use it until you reboot the machine, instead of spending ages optimizing the loadtime.

lucasb-eyer commented 12 years ago

oops warum hab ich alles auf englisch geschrieben :D

ibayer commented 12 years ago

können gerne alles auf englisch schreiben, falls wir das repro später public machen oder mit anderen zusammen arbeiten wollen.

ibayer commented 12 years ago

I don't remember the file format name we saw on the conference,

HDF5, python library: h5py

. Though I'd argue that it's not that bad: load the dict into memory once (wait 2min)

I thought it doesn't fit in memory on a regular machine.

ibayer commented 12 years ago

I can't load the product cataloge

catalog = load_product_catalog(fname="/home/mane/kaggle/cprod1/TrainingSet/products.json")
Traceback (most recent call last):
  File "/home/mane/virt_env/scikit-learn/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2538, in run_code
    exec code_obj in self.user_global_ns, self.user_ns
  File "<ipython-input-1-d2d6213f21bf>", line 1, in <module>
    catalog = load_product_catalog(fname="/home/mane/kaggle/cprod1/TrainingSet/products.json")
  File "/home/mane/git/kaggle/cprod1_dataloading.py", line 8, in load_product_catalog
    return json.loads(open(fname, 'r').read())['Product']
  File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
MemoryError