unum-cloud / usearch

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
https://unum-cloud.github.io/usearch/
Apache License 2.0
1.92k stars 109 forks source link

Bug: cannot open old database (created with 2.9.2) with new version (2.12.0) #423

Open zh217 opened 1 month ago

zh217 commented 1 month ago

Describe the bug

Newer version cannot open database created with older versions of the library.

Steps to reproduce

With the python client version 2.12.0:

usearch.index.Index.restore(idx_path, view=True)

results in

ValueError: Unsupported metric!

where the datafile was created with version 2.9.2 with:

usearch.index.Index(ndim=1024, metric='ip')

Version 2.9.2 can open the datafile without problems.

On further testing, all versions from 2.10.0 onwards fail to open the database.

Expected behavior

Version 2.12.0 should be able to open database created with version 2.9.2, as the version numbers do not indicate any breaking changes.

USearch version

v2.12.0

Operating System

Ubuntu 22.04

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

ashvardanian commented 1 month ago

There were no changes in the file format, but the number of checks and assertions grew. Apparently, one of those checks is hurting us here.

Does it also fail if you create an arbitrary index, and then call .load - reinitializing it with a different file?

zh217 commented 1 month ago

It fails with a different error:

RuntimeError: Key type doesn't match, consider rebuilding

triggered by the following code:

idx = usearch.index.Index(ndim=1024, metric='ip')
idx.load(idx_path)

which runs fine if downgraded to 2.9.2.

ashvardanian commented 1 month ago

Interesting. Any chance the file was corrupted somewhere in between?

zh217 commented 1 month ago

No. Here is a minimal example that you can test:

# run with usearch-2.9.2 installed

import usearch.index

idx = usearch.index.Index(ndim=1024, metric='ip')
idx.save('index')
# run with usearch-2.12.0 installed

import usearch.index

# will throw an error in usearch-2.12.0
idx = usearch.index.Index.restore('index', view=True)

There's no need to insert anything into the database in order to trigger the error. Should be that the metadata in the old version is messed up.

zh217 commented 1 month ago

Update: this works both ways --- old version cannot open databases created by the new version either.

zh217 commented 1 month ago

There were no changes in the file format, but the number of checks and assertions grew. Apparently, one of those checks is hurting us here.

In fact the file format changed due to a subtle change in code.

Compare:

https://github.com/unum-cloud/usearch/blob/5ea48c87c56a25ab57634a8f207f80ae675ed58e/include/usearch/index_plugins.hpp#L122-L142

with:

https://github.com/unum-cloud/usearch/blob/f79d8180122c717203b74f7a7473964c413cb5c1/include/usearch/index_plugins.hpp#L128-L148

so different versions interpret enums in the metadata differently.

As the metadata stored on disk also has version information, we can make new version of the library open old databases by mapping the old values to the new values. There seems to be no easy fix for the reverse direction, however.

As this definitely breaks compatibility between versions (affecting all f16, f32, f64 indices and all languages), this should be marked as a breaking change.

zh217 commented 1 month ago

We can localize the damage by changing what is returned by this function:

https://github.com/unum-cloud/usearch/blob/5ea48c87c56a25ab57634a8f207f80ae675ed58e/include/usearch/index_dense.hpp#L176-L236

Since the result is returned in various places inside the function, maybe it is best to add a method on index_dense_metadata_result_t to "upgrade" its version to the new enum by mutating its headers appropriately.

I can make a pull request for it if that's OK.

ashvardanian commented 1 month ago

Good catch @zh217! I think a good solution would be a custom function to convert enum to integer and vice-versa, with respect to the file version. Can you add it in index_plugins?