[Feature Request]: Number citations sequentially in the CitationQueryEngine.

jimmarshall87 commented 1 year ago

Feature Description

Number citations sequentially in the CitationQueryEngine.

Reason

At present, CitationQueryEngine provides citation references to source nodes used to generate the answer. However, since not all nodes are necessarily used to answer the question, this can lead to output like the following.

The quick brown fox [5] jumped over the lazy dog [2].

When presenting these results to a user, this leads to the below rather strange format:

The quick brown fox [5] jumped over the lazy dog [2].
References:
[5] Source document ABC
[2] Source document DEF

What would be much better and more intuitive from a presentation perspective would be for citations to be numbered starting with the lowest first and sequentially. This would result in something like:

The quick brown fox [1] jumped over the lazy dog [2].
References:
[1] Source document ABC
[2] Source document DEF

Value of Feature

More intuitive
More in line with common citation techniques

logan-markewich commented 1 year ago

Hmm, the tricky part with this is we are numbering the chunks

So, the LLM could see chunks 1-3 in one request, and chunks 4-6 in another request as it builds the answer

So in total, it might have read chunks 1 through 6, but only cited chunks 5 and 2.

And since we are relying on the LLM to write these citations, the only way to reset them to be sequential would be to use a regex and find/count/replace them, which feels a little... janky?

jimmarshall87 commented 1 year ago

Yeah exactly! I was thinking of modifying the class myself to do just that but it is super-janky. I figured someone else in the community much smarter than me would come up with a better way of doing it...!

Maybe the output from the LLM could be handled in a more structured form (e.g. JSON) rather than inserting the citations into the text right away (ie keep text and references separate somehow) - then build up a list of citations whilst the refining process is going on and then at the end, insert the references (again, somehow tbd). It would be slightly cleaner than retrospectively searching and replacing / renumbering.

My guess is this class isn't heavily used, but it offers huge benefits in terms of providing traceability and hallucination avoidance - very important in terms of helping users build confidence in using generative AI for business activities. Already what you have here is great though, kudos!

logan-markewich commented 1 year ago

Yea building a more structured engine would be super nice.

Funny enough, this engine got published like a week before the openai function api came out 😅

I could see maybe a new query engine (or even just a new response synthesizer) that outputs pydantic objects instead of text, for more structured responses 🤔

jimmarshall87 commented 1 year ago

I noticed this today: https://python.langchain.com/docs/use_cases/question_answering/how_to/qa_citations which seems to provide a more structured approach.

logan-markewich commented 1 year ago

Yea that's somewhat along the lines what I was thinking of doing. Althought tbh, I do still like the numbered citation approach, this appears to be relying on the LLM to also generate the citation (which can be unreliable, and also uses more tokens)

jimmarshall87 commented 1 year ago

Indeed - fuzzy match also could be unreliable. I guess there is probably a whole startup product in providing traceability from LLMs and doing this well…!

pjerryhu commented 1 year ago

Hey folks, I'm having some trouble reproducing this issue. Is there good examples?

I was using the semantic scholar, but not successful: https://github.com/emptycrown/llama-hub/tree/03cf2fc1c9e2e2073a6af2d21d8d97fb2f14374b/llama_hub/semanticscholar

jimmarshall87 commented 1 year ago

I'm not familiar with that particular notebook but took a quick look and I think you can see it actually in the section below. In this case, presumably (I haven't really dug in to it) only the 1st and 5th search results from the index were used to generate the answer, so you only see [1] and [5] against the various bullet points. The point I was making in the ticket description above is that it seems a bit abnormal to not have contiguous citations - in my mind these should be listed as [1] [2] as otherwise the reader is left thinking "what happened to [2], [3] and [4]".

Per the discussions above, its sort of inherent in the way the whole thing is working and I sort of came to the conclusion that a better way of handling it would be to provide the output from the query engine in a more structured format where the citations are stored separately to the text but with the metadata to enable them to be retrospectively be inserted (e.g. character position, or full text of sentence to which it refers). Then the calling party could choose how to number/label them. It really is the icing on the cake rather than being a big issue, but I do find the present approach a bit confusing to the layman reader who is not familiar with how the process is actually working underneath.

dosubot[bot] commented 10 months ago

Hi, @jimmarshall87,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. The issue was opened as a feature request to number citations sequentially in the CitationQueryEngine for a more intuitive and common citation format. The discussion in the comments revolves around the challenges of implementing this feature, including the need for a more structured approach to handling the output from the LLM and the potential unreliability of the current citation generation process. There is also mention of a related example and some trouble reproducing the issue.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or need assistance, feel free to reach out.

tompollard commented 6 months ago

Hi @jimmarshall87, I'm looking to implement sequential citations in the way you describe. It sounds like the feature wasn't added to CitationQueryEngine. Did you find a solution?

jimmarshall87 commented 6 months ago

No, I didn't spend much more time on this but would too welcome a solution to it!

tompollard commented 6 months ago

@jimmarshall87 thanks for your quick reply. I'll post here if I get around to implementing something. Currently leaning towards the "janky" regex approach!

AIMusoED commented 6 months ago

Forgive my noob comment, but I'm only just starting to explore these tools. How do you get the sources to present at the bottom of the chat in the form of footnotes?

hpathak-godaddy commented 5 months ago

How do we get the cited source nodes? I am able to see all the source nodes? How one can set a filter. I know one solution is regex, but it would be nice to have cited_nodes as an attribute of response.

Ciaran0 commented 4 months ago

How do we get the cited source nodes? I am able to see all the source nodes? How one can set a filter. I know one solution is regex, but it would be nice to have cited_nodes as an attribute of response.

I would love this as well and also have a more structured response format like pydantic.

run-llama / llama_index