plasticityai / magnitude

A fast, efficient universal vector embedding utility package.
MIT License
1.63k stars 120 forks source link

Ignoring malformed vectors #25

Closed rangwani-harsh closed 6 years ago

rangwani-harsh commented 6 years ago

The current code throws an error given below when it encounters a malformed vector. With this error the partially built SQLLite database couldn't be used to query the vectors written in the database. As metadata is written into the database later. Wouldn't it be good to ignore the malformed vectors (Throwing a warning message to make user know of it) and try building the database anyway?


  File "/home/rajesh/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/rajesh/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/rajesh/anaconda3/lib/python3.6/site-packages/pymagnitude/converter.py", line 509, in <module>
    approx=approx, approx_trees=approx_trees)
  File "/home/rajesh/anaconda3/lib/python3.6/site-packages/pymagnitude/converter.py", line 324, in convert
    for v in vector))
pysqlite2.dbapi2.ProgrammingError: Incorrect number of bindings supplied. The current statement uses 301, and there are 166 supplied. ```
AjayP13 commented 6 years ago

Hi,

Thanks for the suggestion. Since the input file is malformed and this is not a bug with the library, I won't be fixing this. It adds unnecessary complexity to the code for ignoring places where errors are found and could lead to cases where it makes real bugs harder to find if the code just skips malformed inputs. A better warning message would be nice though, which I'll try to add in the future.

Since the converter is really only meant for people who have created their own models (since we pre-convert all popular models), I assume those are sufficiently advanced users such that they could figure out how to build the format file correctly or clean malformed files. I am more open to handling malformed / bad inputs in the actual main library.

FWIW, I'm sure there's some mix of a grep/awk/sed pipeline in bash that could clean your malformed file by counting the number of spaces on each line and filtering lines that don't match the right criteria.

rangwani-harsh commented 6 years ago

Thanks for the suggestion.