microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
https://microsoft.github.io/kernel-memory
MIT License
1.52k stars 293 forks source link

[Question] does content.url in filename for websites make sense? (I want attribution per paragraph via separate prompt) #491

Closed chaelli closed 4 months ago

chaelli commented 4 months ago

Context / Scenario

I changed the prompt to make sure the llm includes the source per paragraph of the answer. So I can more closly align the response with the facts for my users. When I do that, I can only tell it to reference the filename (as this is what the llm gets in the facts part of the prompt). For websites this is always "content.url" - because this is set so in https://github.com/microsoft/kernel-memory/blob/a1f280c42c4df9a60d1d5cecf0633d07ff927b1b/service/Core/MemoryService.cs#L120

Question

I wonder if it would not make more sense to put the url there instead of a static string. Or at least include the url in the facts where it exists.

dluc commented 4 months ago

You should be able to swap content.url with the URL upon receiving the response, there is a property with the URL

chaelli commented 4 months ago

This only works if there is just 1 relevant source - if there are multiple, I would not know which part of the answer is based on what page. If there are multiple sources, they are all called content.url and I cannot align separate sources to separate paragraphs. fyi until I started using kernel memory, I just used a prompt like this:

Add a source reference to the end of each sentence. e.g. Apple is a fruit ([Reference page title](Reference page url)) (markdown link formatting). ...

chaelli commented 4 months ago

@dluc Do you have any preference between the options:

Or none of them?

dluc commented 4 months ago

@dluc Do you have any preference between the options:

* replace "content.url" during indexing with the real url value?

* additing the url as an additional value in the prompt?

Or none of them?

I would try the approach with the prompt, it should be easier. Changing the indexing pipeline might have unexpected impact