rom1504 / embedding-reader

Efficiently read embedding in streaming from any filesystem
MIT License
92 stars 19 forks source link

AttributeError: 'NoneType' object has no attribute 'group' #34

Closed zhenzi0322 closed 7 months ago

zhenzi0322 commented 1 year ago
from embedding_reader import EmbeddingReader

embedding_reader = EmbeddingReader(embeddings_folder="./data/test1/imgs", file_format="npy")

print("embedding count", embedding_reader.count)
print("dimension", embedding_reader.dimension)
print("total size", embedding_reader.total_size)
print("byte per item", embedding_reader.byte_per_item)

for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count):
    print(emb.shape)

error:

Traceback (most recent call last):
  File "/home/kemove/anaconda3/envs/dalle2/lib/python3.9/site-packages/embedding_reader/numpy_reader.py", line 39, in file_to_header
    return (None, [filename, *read_numpy_header(f)])
  File "/home/kemove/anaconda3/envs/dalle2/lib/python3.9/site-packages/embedding_reader/numpy_reader.py", line 28, in read_numpy_header
    shape = (int(result.group(1)), int(result.group(2)))
AttributeError: 'NoneType' object has no attribute 'group'
rom1504 commented 1 year ago

Are you embeddings saved as numpy?

Can you share the first 50 bytes of one file here ?

On Thu, Dec 15, 2022, 10:15 振子 @.***> wrote:

from embedding_reader import EmbeddingReader embedding_reader = EmbeddingReader(embeddings_folder="./data/test1/imgs", file_format="npy") print("embedding count", embedding_reader.count)print("dimension", embedding_reader.dimension)print("total size", embedding_reader.total_size)print("byte per item", embedding_reader.byte_per_item) for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count): print(emb.shape)

error:

Traceback (most recent call last): File "/home/kemove/anaconda3/envs/dalle2/lib/python3.9/site-packages/embedding_reader/numpy_reader.py", line 39, in file_to_header return (None, [filename, *read_numpy_header(f)]) File "/home/kemove/anaconda3/envs/dalle2/lib/python3.9/site-packages/embedding_reader/numpy_reader.py", line 28, in read_numpy_header shape = (int(result.group(1)), int(result.group(2))) AttributeError: 'NoneType' object has no attribute 'group'

— Reply to this email directly, view it on GitHub https://github.com/rom1504/embedding-reader/issues/34, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QGCML32OS3VQK66XTWNLOTXANCNFSM6AAAAAAS7POXTU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

zhenzi0322 commented 1 year ago

I am saving the npy file through the following source.

0.jpg 0

import numpy as np
from PIL import Image

def main(img_file):
    im = Image.open(img_file)
    img_np = np.array(im)
    np.save("0.npy", img_np)

if __name__ == '__main__':
    main(img_file='0.jpg')
rom1504 commented 1 year ago

Embedding reader works with embeddings of fixed dimension

Images are 2 dimensions and not fixed

I recommend you use webdataset instead

On Thu, Dec 15, 2022, 16:17 振子 @.***> wrote:

I am saving the npy file through the following source.

0.jpg [image: 0] https://user-images.githubusercontent.com/34839719/207897811-05699b87-8be3-4329-bdfa-2627e7b87a70.jpg

import numpy as npfrom PIL import Image def main(img_file): im = Image.open(img_file) img_np = np.array(im) np.save("0.npy", img_np) if name == 'main': main(img_file='0.jpg')

— Reply to this email directly, view it on GitHub https://github.com/rom1504/embedding-reader/issues/34#issuecomment-1353252822, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XUL3OYY6IUUTEJPMTWNMZBTANCNFSM6AAAAAAS7POXTU . You are receiving this because you commented.Message ID: @.***>

zhenzi0322 commented 1 year ago

How does webdataset convert my local jpg images and text into .npy files?

rom1504 commented 1 year ago

It doesn't, you don't need npy for image files

On Fri, Dec 16, 2022, 02:43 振子 @.***> wrote:

How does webdataset convert my local jpg images and text into .npy files?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/embedding-reader/issues/34#issuecomment-1354060247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437WPEW7KVNMKTGWHF3DWNPCNXANCNFSM6AAAAAAS7POXTU . You are receiving this because you commented.Message ID: @.***>

zhenzi0322 commented 1 year ago

ok.

img2dataset --url_list=myimglist.txt --output_folder=image_folder --thread_count=3 --image_size=256

clip-retrieval inference --input_dataset image_folder --output_folder embeddings_folder

https://github.com/rom1504/clip-retrieval#clip-inference

Cannot generate data

rom1504 commented 7 months ago

please reopen if you still have the problem