ROADMAP 2024 - Githubissues

writinwaters commented 11 months ago

v0.6.0 planning

Core

[ ] Unify the memory management of index and data
[ ] Use memory usage to decide when flush the memory index
[ ] Switch for near-real-time and real-time index
[ ] Performance improvement of 'long text'
[ ] Supports DiskANN index. #1953
[ ] Supports system level data backup and recovery
[ ] Cluster fail over

Tools

[ ] GUI #1841

v0.5.0

Core:

[x] Supports Product quantization. #2077
[x] Supports Scalar quantization (u8). #2085
[x] Supports Binary vector with Hamming distance.#2069
[x] Supports Cluster management, log replication
[x] Supports result caching and paginate function #1903
[x] Supports regular expression on varchar field. #1986
[x] Support providing comment when create database / index / table #2050 #2052
[x] Support boolean similarity for fulltext search #2139
[x] Support threshold for BM25, L2/IP/Cosine metrics for full-text/dense vector/sparse vector/tensor matching #2115
[x] Supports Aarch64.#1084
Integration
[x] Integrates with RAGFlow #1405

Tools

[x] GUI #1841

v0.4.0

Core:

[x] Enable IVF Index. #1917
[x] Supports filter expression to match_dense/match_sparse/match_text/fusion respectively. #1803
[x] Supports using full text as filter. #1803
[x] Supports MinShouldMatch as a full text filter. #1862
[x] Supports date and time type. #1804 #1824
[x] Supports IN operator in filter expression. #1839
[x] Supports lock/unlock table to prevent manipulating. #1813
[x] Add/Remove column by locking table.
[x] Supports Korean for full text search. #1228
[x] Supports highlighter on full text search #1861
[x] Suports to add column comment when creating table. #2038
[x] Supports analyzer from RAGFlow #1973

Integration

[x] Supports S3 storage. #1809
[x] Refactor file IO part to integrates S3, NFS, and local filesystem.

API

[x] Create an specific embedded Infinity python module. #1786
[x] Support orderby / sort function. #1944

Tools

[x] GUI, list databases / tables / show variables / show configs #1841

v0.3.0

Core:

[x] Virtual file system. #1184
[x] Memory optimization of sparse index building. #1436
[x] Support unordered sparse embedding index when importing data. #1419
[x] Unify SIMD operations together. #1473
[x] Supports bf16 embedding data type. #1579
[x] Supports f16 embedding data type. #1579
[x] Supports int8 embedding data type. #1527
[x] Supports multiple vectors on one document. #1679
[x] Smart full-text query syntax. #1622
[x] Use full-checkpoint, export and import parquet file to support data backup and restore.
[x] Supports export parquet files. #1330
[x] Supports import parquet files. #1330

v0.2.0

[x] Supports sparse vector index. #1174
[x] Supports tensor data type. #1179
[x] Supports cosine similarity. #1176
[x] Supports configurable reciprocal rank fusion operator. #1177
[x] Multiple recall supports more than two ways. #1178
[x] HTTP API: Supports GET/SET variables. #1180
[x] Embedded infinity. #1181
[x] Export data into CSV and JSONL file type. #1175
[x] Unify the background computation task running into task executor. #1182
[x] Integrates later interaction models, such as Colbert. #1279
[x] Support building secondary index on string columns. #1235
[x] Support Japanese for full text search. #1137
[x] Support traditional Chinese for full text search. #1376
[x] Support near by query for full text search. #1346

v0.1.0

[x] Building HNSW index in parallel. #341
[x] Supports aggregate operation. #357
[x] Supports order by (sort) operation. #339
[x] Supports limit operation. #362
[x] Supports order by + limit as top operation. #408
[x] Secondary index on structured data type. #360
[x] New full text search. #358
[x] Minmax of column data. #448
[x] Bloomfilter of structured data column. #467
[x] Refactor ColumnVector: Reduce serialization times as much as possible. #449
[x] Supports new data type: date. #371
[x] Supports new data type: bool. #394
[x] Refactor meta data: Provides a clear interface to access meta data, instead of traversing meta data tree. #368
[x] Refactor error handling: Provides normalized error code and error message. #439
[x] Segment GC and segment compaction. #466
[x] Refactor WAL with physical log, instead of logical log. #431
[x] Asynchronous index building: Data become query-able once imported / inserted.
[x] Storage clean up: Deprecated index/segment/catalog ... files need to be clean up to save the disk space. #635
[x] Incremental checkpoint. #438
[x] New python API to show database system value. #495
[x] New python API to explain the query plan. #496
[x] HTTP API #779

Backlog

Core

[ ] Native supports MacOS(m1) and Windows
[ ] Supports authentication with default roles.
[ ] Use KV store as the meta data store.

Integration

[ ] Supports NFS.
[ ] Integrates with Langchain
[ ] Integrates with llamdindex
[ ] Embedding function.

Tools

[ ] Infinity database backup and restore tools. #1183
[ ] Monitoring tools.
[ ] Data migration tool.

yuzhichang commented 11 months ago

CI improvement: post logs of infinity when CI failure, use Ubuntu 20.04 as base of dev image. Fuzz test of infinity.

cjkbjhb commented 11 months ago

Secordary index on structured data type. ---> Secondary index on structured data types.

Here is a mis-spelling error.

JinHai-CN commented 11 months ago

Secondary

Fixed and thank you.

yuzhichang commented 10 months ago

compatibility testing

image	tag	refer
centos	7 8	https://hub.docker.com/_/centos/
ubuntu	20.04 22.04 24.04	https://hub.docker.com/_/ubuntu https://releases.ubuntu.com/
debian	8 9 10 11 12	https://hub.docker.com/_/debian https://www.debian.org/releases/
opensuse/leap	15.0 15.1 15.2 15.3 15.4 15.5	https://hub.docker.com/r/opensuse/leap
openeuler/openeuler	20.03 22.03	https://hub.docker.com/r/openeuler/openeuler
openanolis/anolisos	8.6 23	https://hub.docker.com/r/openanolis/anolisos
openkylin/openkylin	1.0	https://hub.docker.com/r/openkylin/openkylin

Kelvinyu1117 commented 10 months ago

I would like to contribute to this project, which issue would be a good start?

JinHai-CN commented 10 months ago

@Kelvinyu1117 We do have a couple of issues that might work for contributors new to this project.

Add minmax information to blocks/segments in the current datastore. This information is primarily used for data filtering. (#448)
Implement a bloomfilter for the blocks/segments to enhance point queries. (#467)
Currently, query results are stored in memory in a columnar format. However, the client expects the results in Apache Arrow format. At the moment, the format conversion is executed on the Python client, but this worsens the performance, so we plan to convert the results to Apache Arrow format on the server side before sending them to the client.
There are several optimizer rules to implement, such as constant folding and simplification of arithmetic expressions, which are not yet on the roadmap. Feel free to work on them if interested.
We have additional complicated tasks not listed here. For instance, the current executor operates with one thread per CPU. We're considering using coroutine to enhance efficiency, but we don't have a solid solution yet. If you have experience in this area, you are very welcome to propose your solution.
We understand you're interested in contributing C++ code. However, if that's not the case, there's also unimplemented Python code, such as test cases and the Python SDK API.

abdullah-alnahas commented 6 months ago

Your work is exceptional! I would like to propose that, considering the current landscape, incorporating binary quantization and ColBERT-like ranking would be crucial for any vector database. Apologies for commenting on the road map issue instead of creating a separate feature request.

JinHai-CN commented 6 months ago

Your work is exceptional! I would like to propose that, considering the current landscape, incorporating binary quantization and ColBERT-like ranking would be crucial for any vector database. Apologies for commenting on the road map issue instead of creating a separate feature request.

Nice, we will put this request into v0.2.0 release.

niebayes commented 6 months ago

@JinHai-CN Hi, I have experience in developing a database using Arrow. Is the issue that converting query results to Arrow format still active? I'd like to take it.

JinHai-CN commented 6 months ago

@niebayes #1198, issue is created and we can discuss the requirement in that issue.

infiniflow / infinity

ROADMAP 2024 #338

v0.6.0 planning

Core

Tools

v0.5.0

Core:

Integration

Tools

v0.4.0

Core:

Integration

API

Tools

v0.3.0

Core:

v0.2.0

v0.1.0

Backlog

Core

Integration

Tools