rom1504 / embedding-reader

Efficiently read embedding in streaming from any filesystem
MIT License
92 stars 19 forks source link

str() causes .npy file header to fail regex #37

Closed 796F closed 1 year ago

796F commented 1 year ago

I'm using the attached npy embedding. this npy file is (768, ) in size, and was computed using CLIP ViT-L/14 and saved using the below function

  filename = some_name
  emb = np.frombuffer(vector_data, dtype='float16')
  np.save(path, emb)

when I use embedding_reader to load this, first file fails due to a header parsing error.

python --version
Python 3.10.8

...
tracing your code using fsspec and reading the failing file manually
...

>>> f.seek(0)
0
>>> f
<fsspec.implementations.local.LocalFileOpener object at 0x1049fb940>
>>> f.size
1664
>>> isinstance(f.size, int)
True
>>> file_size = f.size
>>> file_size
1664
>>> first_line = f.read(min(file_size, 300)).split(b"\n")[0]
>>> first_line
b"\x93NUMPY\x01\x00v\x00{'descr': '<f2', 'fortran_order': False, 'shape': (768,), }                                                          "
>>> result = re.search(r"'shape': \(([0-9]+), ([0-9]+)\)", str(first_line))
>>> result
>>> str(first_line)
'b"\\x93NUMPY\\x01\\x00v\\x00{\'descr\': \'<f2\', \'fortran_order\': False, \'shape\': (768,), }

it seems like when str is cast, escape characters are added which cause the header parsing to fail? is this not the expected result?

this is causing autofaiss to break when building an index. I'm able to use autofaiss when loading the embeddings from memory, so not likely their issue?

796F commented 1 year ago
pip list | grep embedding
embedding-reader         1.5.0

seems up to date as well.

796F commented 1 year ago

or is it because the shape is unexpected? and needs to be reshaped to (768,1) ?

rom1504 commented 1 year ago

Npy should contain a matrix of embedding, not a single one

Eg (1000, 768) for 1000 items

On Wed, Feb 15, 2023, 11:51 M:kë @.***> wrote:

or is it because the shape is unexpected? and needs to be reshaped to (768,1) ?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/embedding-reader/issues/37#issuecomment-1431142030, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TOPGBEOOFW6SYTI23WXSYMVANCNFSM6AAAAAAU4VYKIM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

796F commented 1 year ago

oh I see, so I should use np.hstack. then save?

796F commented 1 year ago

vstack got it to work, thanks @rom1504 will close this!