iusztinpaul / hands-on-llms

🦖 𝗟𝗲𝗮𝗿𝗻 about 𝗟𝗟𝗠𝘀, 𝗟𝗟𝗠𝗢𝗽𝘀, and 𝘃𝗲𝗰𝘁𝗼𝗿 𝗗𝗕𝘀 for free by designing, training, and deploying a real-time financial advisor LLM system ~ 𝘴𝘰𝘶𝘳𝘤𝘦 𝘤𝘰𝘥𝘦 + 𝘷𝘪𝘥𝘦𝘰 & 𝘳𝘦𝘢𝘥𝘪𝘯𝘨 𝘮𝘢𝘵𝘦𝘳𝘪𝘢𝘭𝘴
MIT License
3.11k stars 483 forks source link

Fix Payload Duplication Bug in build_payloads Function by Copying Metadata for Each Chunk #89

Open suryadevarapranav opened 2 weeks ago

suryadevarapranav commented 2 weeks ago

This pull request resolves the Issue #88.

Description

Overview: This pull request addresses the issue of duplicated text in the payloads generated by the build_payloads function. Previously, each payload incorrectly contained the same text due to a shared reference to doc.metadata, which caused in-place updates across all payloads.

Changes Made:

Updated build_payloads to create a shallow copy of doc.metadata for each chunk. This ensures each payload dictionary is independent, containing the correct text for each specific chunk. Added a comment for clarity.

Testing: Tested the function with multiple chunks, confirming that each payload now contains the expected text and metadata without interference.

Example:

News document from Alpaca.

image

Payloads uploaded to Qdrant.

image

Qdrant Query used to verify the issue fix.

POST collections/alpaca_news/points/scroll
{
  "filter": {
    "must": [
      {
        "key": "date",
        "match": {
          "text": "2024-01-01T13:15:32+00:00"
        }
      }
    ]
  }
}
suryadevarapranav commented 2 weeks ago

Hello @iusztinpaul @Paulescu @Joywalker, Can you please take a look at my pull request, when you find some time?

Thanks, Deva.