kakao / buffalo

TOROS Buffalo: A fast and scalable production-ready open source project for recommender systems
Apache License 2.0
576 stars 106 forks source link

The test code `tests/data/test_mm.py` does not work. #80

Closed dkkim1005 closed 8 months ago

dkkim1005 commented 8 months ago

Bug

OSError is raised when executing the test code tests/data/test_mm.py. All test cases failed for the same issue.

$ nosetests ./data/test_mm.py -v

test0_get_default_option (data.test_mm.TestMatrixMarket) ... ok                                                                                                                                                                                                                                
test1_is_valid_option (data.test_mm.TestMatrixMarket) ... ok                                                                                                                                                                                                                                   
test2_create (data.test_mm.TestMatrixMarket) ... [INFO    ] 2023-12-19 04:03:30 [mm.py:247] Create the database from matrix market file.                                                                                                                                                       
[DEBUG   ] 2023-12-19 04:03:30 [mm.py:252] Building meta part...                                                                                                                                                                                                                               
^M[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s[INFO    ] 2023-12-19 04:03:30 [base.py:179] File ./mm.h5py exists. To build new database, existing file ./mm.h5py will be deleted.                                                                                                                     
[ERROR   ] 2023-12-19 04:03:30 [mm.py:162] Cannot create db: Can't write data (no appropriate function for conversion path)                                                                                                                                                                    
[ERROR   ] 2023-12-19 04:03:30 [mm.py:163] Traceback (most recent call last):                                                                                                                                                                                                                  
  File "/home/bc-user/.local/lib/python3.10/site-packages/buffalo/data/mm.py", line 141, in _create                                                                                                                                                                                            
    idmap["rows"][:] = np.loadtxt(fin, dtype=f"S{uid_max_col}")                                                                                                                                                                                                                                
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper                                                                                                                                                                                                                        
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper                                                                                                                                                                                                                        
  File "/home/bc-user/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py", line 999, in __setitem__                                                                                                                                                                                       
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)                                                                                                                                                                                                                                 
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper                                                                                                                                                                                                                        
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper                                                                                                                                                                                                                        
  File "h5py/h5d.pyx", line 283, in h5py.h5d.DatasetID.write                                                                                                                                                                                                                                   
  File "h5py/_proxy.pyx", line 114, in h5py._proxy.dset_rw                                                                                                                                                                                                                                     
OSError: Can't write data (no appropriate function for conversion path)

......(skip the middle lines)

MatrixMarketDataReader: DEBUG: creating temporary matrix-market data from numpy-kind array
MatrixMarket: INFO: Create the database from matrix market file.
MatrixMarket: DEBUG: Building meta part...
[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s
MatrixMarket: INFO: File ./mm.h5py exists. To build new database, existing file ./mm.h5py will be deleted.
MatrixMarket: ERROR: Cannot create db: Can't write data (no appropriate function for conversion path)
MatrixMarket: ERROR: Traceback (most recent call last):
  File "/home/bc-user/.local/lib/python3.10/site-packages/buffalo/data/mm.py", line 141, in _create
    idmap["rows"][:] = np.loadtxt(fin, dtype=f"S{uid_max_col}")
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/bc-user/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py", line 999, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 283, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 114, in h5py._proxy.dset_rw
OSError: Can't write data (no appropriate function for conversion path)

[PROGRESS] 100.00% 0.0/0.0secs 1,137.96it/s

--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 10 tests in 0.041s

FAILED (errors=5)

The cause is from mismatching between the data type of HDF5 and the numpy object, as annotated in the above error log. The current version only supports "utf-8" encoding for creating idmap, which makes the MatrixMarket object fail to load both user and item ID lists. To resolve the issue, converting the encoding rule from "utf-8" to "ascii" might be the feasible way. I tested a code with the local patch(buffalo/data/base.py) as follows,

# Method in Data class
def _create_database(self, path, **kwargs):
    ......
    [ASIS]
    idmap.create_dataset("rows", (num_users,), dtype=h5py.string_dtype("utf-8", length=uid_max_col),
                         maxshape=(num_users,))
    idmap.create_dataset("cols", (num_items,), dtype=h5py.string_dtype("utf-8", length=iid_max_col),
                         maxshape=(num_items,))
    ......
    [TOBE]
    idmap.create_dataset("rows", (num_users,), dtype=h5py.string_dtype("ascii", length=uid_max_col),
                         maxshape=(num_users,))
    idmap.create_dataset("cols", (num_items,), dtype=h5py.string_dtype("ascii", length=iid_max_col),
                         maxshape=(num_items,))
    ......
test0_get_default_option (data.test_mm.TestMatrixMarket) ... ok
test1_is_valid_option (data.test_mm.TestMatrixMarket) ... ok
test2_create (data.test_mm.TestMatrixMarket) ...
[INFO    ] 2023-12-19 04:54:58 [mm.py:247] Create the database from matrix market file.
[DEBUG   ] 2023-12-19 04:54:58 [mm.py:252] Building meta part...
[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s[INFO    ] 2023-12-19 04:54:58 [base.py:179] File ./mm.h5py exists. To build new database, existing file ./mm.h5py will be deleted.
[PROGRESS] 100.00% 0.0/0.0secs 742.35it/s
[INFO    ] 2023-12-19 04:54:58 [mm.py:260] Creating working data...
[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s^M[PROGRESS] 100.00% 0.0/0.0secs 168,937.24it/s
[DEBUG   ] 2023-12-19 04:54:58 [mm.py:264] Working data is created on /tmp/tmpr5a6iwrk
[INFO    ] 2023-12-19 04:54:58 [mm.py:265] Building data part...
[INFO    ] 2023-12-19 04:54:58 [base.py:417] Building compressed triplets for rowwise...
[INFO    ] 2023-12-19 04:54:58 [base.py:418] Preprocessing...
[INFO    ] 2023-12-19 04:54:58 [base.py:421] In-memory Compressing ...
[INFO    ] 2023-12-19 04:54:59 [base.py:301] Load triplet files. Total job files: 73
[INFO    ] 2023-12-19 04:54:59 [base.py:451] Finished
[INFO    ] 2023-12-19 04:54:59 [base.py:417] Building compressed triplets for colwise...
[INFO    ] 2023-12-19 04:54:59 [base.py:418] Preprocessing...
[INFO    ] 2023-12-19 04:54:59 [base.py:421] In-memory Compressing ...
[INFO    ] 2023-12-19 04:54:59 [base.py:301] Load triplet files. Total job files: 73
[INFO    ] 2023-12-19 04:54:59 [base.py:451] Finished
[INFO    ] 2023-12-19 04:54:59 [mm.py:279] DB built on ./mm.h5py
ok
......(skip the middle lines)
test3_list (data.test_mm.TestMatrixMarketReader) ... [DEBUG   ] 2023-12-19 04:55:01 [mm.py:70] creating temporary matrix-market data from numpy-kind array
ok

----------------------------------------------------------------------
Ran 10 tests in 3.166s

OK

However, this patch is not functional for treating w2v training(PR) in which "utf-8" characters are employed to train Korean words. To reconcile this conflict, providing the appropriate encoding rules for both loading a matrix-market file and a stream data file is one of the feasible actions.

chiwanpark commented 8 months ago

@dkkim1005 We need to unify the type of string data to h5py.string_dtype("utf-8"). Could you send a PR fixing this bug?