vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.64k stars 590 forks source link

[Schema streaming mode] bm25 score is always zero #30220

Closed xansrnitu closed 7 months ago

xansrnitu commented 7 months ago

Describe the bug I am always getting bm25 score as zero even though the target text field contains my search key.

To Reproduce Steps to reproduce the behavior:

  1. My Schema and ranking definition (python)-

    
    Schema(
                name="summaries",
                mode="streaming", 
                document=VespaDoc(
                        fields=[
                            Field(name="id", type="string", indexing=["summary", "index"]),
                            Field(name="title", type="string", indexing=["summary", "index"]),
                            Field(name="page", type="int", indexing=["summary", "index"]),
                            Field(name="metadata", type="map<string,string>", indexing=["summary", "index"]),
                            Field(name="content", type="string", indexing=["summary", "index"]),
                            Field(name="embedding", type="tensor<bfloat16>(x[384])",
                                indexing=["input content", "embed e5", "attribute", "index"],
                                ann=HNSW(distance_metric="angular"),
                                is_document_field=False
                            )
                        ]
                ),
                fieldsets=[
                    FieldSet(name = "default", fields = ["content"])
                ]
            )
    
         bm25 = RankProfile(
            name="bm25", 
            functions=[Function(
                name="my_bm25", expression="bm25(content)"
            )],
            first_phase=FirstPhaseRanking(
                expression="my_bm25"
            ),
            match_features=["my_bm25"],
        )   
    
        passage_schema.add_rank_profile(bm25)
  2. Now, if I query with -

    app.query(
                    yql="select content from summaries where userQuery()",
                    groupname="xyz",
                    query="Jaganmohan",
                    hits = 4,
                    ranking="bm25"
                )

    returns 'features': {'my_bm25': 0.0} despite the fact that "content" field contains "Jaganmohan". I have tried with some other queries as well, all giving 0 score.

Environment (please complete the following information):

Vespa version 8.279.6

Additional context nativeRank(content) is returning positive score ( > 0)

bratseth commented 7 months ago

Thanks for reporting. The bm25 feature is correct during ranking (e.g if you output it as the first-phase score), but becomes 0 in sumary-features/match-features.

geirst commented 7 months ago

This is fixed in Vespa 8.302.40. System tests added in https://github.com/vespa-engine/system-test/pull/3322 and https://github.com/vespa-engine/system-test/pull/3327.