rom1504 / embedding-reader

Efficiently read embedding in streaming from any filesystem
MIT License
92 stars 19 forks source link

No embeddings found in folder #60

Closed loretoparisi closed 5 months ago

loretoparisi commented 5 months ago

The embedding-reader module to read numpy files here in the NumpyReader class:

class `NumpyReader`:
    """Numpy reader class, implements init to read the files headers and call to procuce embeddings batches"""

    def __init__(self, embeddings_folder):
        self.embeddings_folder = embeddings_folder
        self.fs, embeddings_file_paths = get_file_list(embeddings_folder, "npy")
        headers = get_numpy_headers(embeddings_file_paths, self.fs)
        self.headers = pd.DataFrame(
            headers,
            columns=["filename", "count", "count_before", "dimension", "dtype", "header_offset", "byte_per_item"],
        )

        self.count = self.headers["count"].sum()
        if self.count == 0:
            raise ValueError(f"No embeddings found in folder {embeddings_folder}") # <--- this error

fails to read the folder that actually contains the files. The code failing is the value of variable count.

I'm using the embedding-reader in the autofaiss library, reporting the issue here.

rom1504 commented 5 months ago

Did you check it actually contain files ?

On Thu, Mar 21, 2024, 8:23 AM Loreto Parisi @.***> wrote:

The embedding-reader module to read numpy files here https://github.com/rom1504/embedding-reader/blob/5e528f4d0b5a6225e50fe640c540abdd7e4d31a5/embedding_reader/numpy_reader.py#L77 in the NumpyReader class:

class NumpyReader: """Numpy reader class, implements init to read the files headers and call to procuce embeddings batches"""

def __init__(self, embeddings_folder):
    self.embeddings_folder = embeddings_folder
    self.fs, embeddings_file_paths = get_file_list(embeddings_folder, "npy")
    headers = get_numpy_headers(embeddings_file_paths, self.fs)
    self.headers = pd.DataFrame(
        headers,
        columns=["filename", "count", "count_before", "dimension", "dtype", "header_offset", "byte_per_item"],
    )

    self.count = self.headers["count"].sum()
    if self.count == 0:
        raise ValueError(f"No embeddings found in folder {embeddings_folder}") # <--- this error

fails to read the folder that actually contains the files. The code failing is the value of variable count.

I'm using the embedding-reader in the autofaiss library, reporting the issue https://github.com/rom1504/embedding-reader/blob/5e528f4d0b5a6225e50fe640c540abdd7e4d31a5/embedding_reader/numpy_reader.py#L77 here.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/embedding-reader/issues/60, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437S4U3ZEYUZPP2NA4ALYZKDHHAVCNFSM6AAAAABFA4HBSSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TSMZXHE4TKNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

loretoparisi commented 5 months ago

Did you check it actually contain files ?

Yes the files are there - see here. It'a actually just one .npy file - coming from the auto-faiss indexing method build_index here, thank you.