Remove small chunks of minimal semantic value. These are an issue b/c they can match queries that wouldn't otherwise have matches b/c of metadata similarity.
Notes
Filtering on 15 token number is fairly arbitrary but I think serves as a decent heuristic for now.
Looking at the database, all these "micro-chunks" are either the string "```" or page titles "# Title Here". these make sense given our chunking algorithm, and should cause no harm by removal. I don't think we're losing any value here.
Example of tiny chunks getting match
```
"Chunks found: [
{
"sourceName": "snooty-cloud-docs",
"url": "https://mongodb.com/docs/atlas/atlas-search/tutorial/lookup-with-search/",
"score": 0.91689532995224,
"text": "---\ntags:\n - atlas\n - docs\nproductName: MongoDB Atlas\nversion: null\npageTitle: \"How to Run \"\nhasCodeBlock: true\n---\n\n```",
"tokenCount": 45,
"updated": "2023-09-01T06:05:07.609Z",
"metadata": {
"tags": [
"atlas",
"docs"
],
"productName": "MongoDB Atlas",
"version": null,
"pageTitle": "How to Run ",
"hasCodeBlock": true
},
"chunkIndex": 8
},
{
"sourceName": "snooty-cloud-docs",
"url": "https://mongodb.com/docs/atlas/atlas-search/tutorial/lookup-with-search/",
"score": 0.91689532995224,
"text": "---\ntags:\n - atlas\n - docs\nproductName: MongoDB Atlas\nversion: null\npageTitle: \"How to Run \"\nhasCodeBlock: true\n---\n\n```",
"tokenCount": 45,
"updated": "2023-09-01T06:05:07.479Z",
"metadata": {
"tags": [
"atlas",
"docs"
],
"productName": "MongoDB Atlas",
"version": null,
"pageTitle": "How to Run ",
"hasCodeBlock": true
},
"chunkIndex": 5
},
...
]
```
@cbush can you give another look. i decided to put this as a config, not use the transform since i think this is a generalizable enough thing that it ought to be included in the core functionality.
Jira: n/a
Changes
Notes
15
token number is fairly arbitrary but I think serves as a decent heuristic for now."```"
or page titles"# Title Here"
. these make sense given our chunking algorithm, and should cause no harm by removal. I don't think we're losing any value here.Example of tiny chunks getting match
``` "Chunks found: [ { "sourceName": "snooty-cloud-docs", "url": "https://mongodb.com/docs/atlas/atlas-search/tutorial/lookup-with-search/", "score": 0.91689532995224, "text": "---\ntags:\n - atlas\n - docs\nproductName: MongoDB Atlas\nversion: null\npageTitle: \"How to Run \"\nhasCodeBlock: true\n---\n\n```", "tokenCount": 45, "updated": "2023-09-01T06:05:07.609Z", "metadata": { "tags": [ "atlas", "docs" ], "productName": "MongoDB Atlas", "version": null, "pageTitle": "How to Run ", "hasCodeBlock": true }, "chunkIndex": 8 }, { "sourceName": "snooty-cloud-docs", "url": "https://mongodb.com/docs/atlas/atlas-search/tutorial/lookup-with-search/", "score": 0.91689532995224, "text": "---\ntags:\n - atlas\n - docs\nproductName: MongoDB Atlas\nversion: null\npageTitle: \"How to Run \"\nhasCodeBlock: true\n---\n\n```", "tokenCount": 45, "updated": "2023-09-01T06:05:07.479Z", "metadata": { "tags": [ "atlas", "docs" ], "productName": "MongoDB Atlas", "version": null, "pageTitle": "How to Run ", "hasCodeBlock": true }, "chunkIndex": 5 }, ... ] ```