[Question]: Different similarity_top_k values return data with significant differences and varying scores

sekingme commented 3 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

my method code like:

def retrieve_data(self, collection_name: str, similarity_top_k: int, query: str, **kwargs) -> list:
        result = []
        try:
            # 设置向量数据集
            chroma_collection = self.chroma_client.get_collection(name=collection_name)

            # 设置向量存储引擎
            vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
            index = VectorStoreIndex.from_vector_store(vector_store)

            # 检索
            retriever = index.as_retriever(similarity_top_k=similarity_top_k)
            response = retriever.retrieve(query)

            for node in response:
                result.append(node.get_content())
                result.append(node.get_score())

        except Exception as e:
            log.error(f"Failed to retrieve {collection_name} data from chroma store. {e}", exc_info=True)
            raise e

        return result

take method as: vector.retrieve_data(collection_name="hive", similarity_top_k=15, query="视频被踩的累计快照事实表")

while set similarity_top_k to 15，it returns top3 data with score as:

top1: {'表信息': {'db': 'algo', 'table': 'video_statistical_table', '字段': [{'name': 'video_id'}, {'name': 'vv'}, {'name': 'product_click'}, {'name': 'effect_click'}, {'name': 'music_click'}, {'name': 'duet_click'}, {'name': 'topic_click'}, {'name': 'follow_cnt'}, {'name': 'like_cnt'}, {'name': 'click_cnt'}, {'name': 'shared_cnt'}, {'name': 'comments_cnt'}, {'name': 'exposed'}, {'name': 'country_region'}], '分区': [{'name': 'day'}]}}
score: 0.32294393044439756

top2: {'表信息': {'db': 'algo', 'table': 'video_valid_play_table', '字段': [{'name': 'country_region'}, {'name': 'video_id'}, {'name': 'valid_play_count'}, {'name': 'dispatch_cnt'}, {'name': 'impression_cnt'}, {'name': 'play_cnt'}, {'name': 'like_cnt'}, {'name': 'follow_cnt'}, {'name': 'share_cnt'}, {'name': 'comment_cnt'}, {'name': 'complete_cnt'}, {'name': 'pos_cnt'}, {'name': 'fix_valid_play_count'}, {'name': 'dur'}, {'name': 'play_second'}, {'name': 'fix_complete_cnt'}], '分区': [{'name': 'day'}]}}
score: 0.3165133714145665

top3: {'表信息': {'db': 'mysql_tb', 'table': 'welog_tbl_video_counter', '字段': [{'name': 'rdeleted'}, {'name': 'rversion'}, {'name': 'post_id'}, {'name': 'comment_count'}, {'name': 'like_count'}, {'name': 'play_count'}, {'name': 'share_count'}, {'name': 'update_time'}, {'name': 'robot_comment_count'}, {'name': 'robot_like_count'}, {'name': 'robot_play_count'}], '分区': []}}
score: 0.3164510076639161

but while set similarity_top_k to 16，it returns top3 data with score as:

top1: {'中文名': '视频被踩的累计快照事实表(likee-汇总-社交互动)', '表信息': {'db': 'like_dw_sid', 'table': 'dws_like_sid_his_acc_video_dislike_producer', 'desc': '视频被踩的累计快照事实表', '字段': [{'name': 'video_id', 'desc': '视频id', 'type': 'bigint'}, {'name': 'video_author_uid', 'desc': '视频作者id', 'type': 'bigint'}, {'name': 'video_create_time', 'desc': '视频生产时间', 'type': 'bigint'}, {'name': 'his_acc_video_dislike_count_02', 'desc': '历史累计视频被踩次数', 'type': 'bigint'}, {'name': 'first_video_dislike_dt_02', 'desc': '首次视频被踩日期', 'type': 'string'}, {'name': 'latest_video_dislike_dt_02', 'desc': '最近1次视频被踩日期', 'type': 'string'}], '分区': [{'name': 'day', 'desc': '数据上报日期', 'type': 'string'}]}}
score: 0.442986748854532

top2: {'中文名': '视频被踩的汇总事实表(likee-汇总-社交互动)', '表信息': {'db': 'like_dw_sid', 'table': 'dws_like_sid_video_dislike_producer_1d_01', 'desc': '视频被踩的汇总事实表', '字段': [{'name': 'video_id', 'desc': '视频id', 'type': 'bigint'}, {'name': 'video_author_uid', 'desc': '视频作者id', 'type': 'bigint'}, {'name': 'video_author_hdid', 'desc': '视频作者hdid', 'type': 'string'}, {'name': 'video_create_time', 'desc': '视频创建时间', 'type': 'bigint'}, {'name': 'country', 'desc': '国家', 'type': 'string'}, {'name': 'os', 'desc': '手机操作系统', 'type': 'string'}, {'name': 'refer_list', 'desc': '视频列表', 'type': 'string'}, {'name': 'video_dislike_count_1d_02', 'desc': '最近1天视频被踩次数', 'type': 'bigint'}], '分区': [{'name': 'day', 'desc': '数据上报日期', 'type': 'string'}]}}
score: 0.40496785241401567

top3: {'中文名': '历史累计快照事实表(likee-汇总-社交互动)', '表信息': {'db': 'like_dw_sid', 'table': 'dws_like_sid_his_acc_video_share_viewer', 'desc': '历史累计快照事实表', '字段': [{'name': 'uid', 'desc': '用户id', 'type': 'bigint'}, {'name': 'hdid', 'desc': '海度id', 'type': 'string'}, {'name': 'his_acc_video_share_send_count_01', 'desc': '历史累计视频分享次数', 'type': 'bigint'}, {'name': 'first_video_share_send_dt_01', 'desc': '首次视频分享日期', 'type': 'string'}, {'name': 'latest_video_share_send_dt_01', 'desc': '最近一次视频分享日期', 'type': 'string'}, {'name': 'his_acc_video_download_count_01', 'desc': '历史累计视频下载次数', 'type': 'bigint'}, {'name': 'first_video_download_dt_01', 'desc': '首次视频下载日期', 'type': 'string'}, {'name': 'latest_video_download_dt_01', 'desc': '最近一次视频下载日期', 'type': 'string'}], '分区': [{'name': 'day', 'desc': '数据上报日期', 'type': 'string'}]}}
score: 0.3822781247007505

both data and score is significant differences，why? Only set similarity_top_k bigger than 16, it return the right value.

logan-markewich commented 3 months ago

Chroma uses HMSW (an approximately method) to search. Likely some symptom of that?

sekingme commented 3 months ago

Chroma uses HMSW (an approximately method) to search. Likely some symptom of that?

@logan-markewich while use milvus db，it shows the same symptom.

logan-markewich commented 3 months ago

🤷🏻 probably the same issue with hsnw? I don't think this issue is related to llama-index. If you used milvus or chroma directly, I would expect a similar behavior

sekingme commented 3 months ago

🤷🏻 probably the same issue with hsnw? I don't think this issue is related to llama-index. If you used milvus or chroma directly, I would expect a similar behavior

Is there any fix way?

sekingme commented 3 months ago

already fix myself, thanks all.

run-llama / llama_index

[Question]: Different similarity_top_k values return data with significant differences and varying scores #14923

Question Validation

Question