How do RAGs / LLMs understand time?

lightnin commented 1 month ago

It seems like an important aspect of what we're interested in has to do with time.

Things happen before other things that have a causal effect. Does the LLM understand this?
When you give it a big corpus of data / documents, all of which are time stamped, can it construct narratives (as in: start, middle, finish) to make sense of the data?
We might be able to use multiquery to extend this a little bit. As in:
1. Find the first instance / discussion of "subject evolving over time" in the data and summarize it.
2. Find the last substantial discussion of "subject evolving over time" in the data and summarize it.
3. Summarize all substantial discussions of "subject evolving over time" in between the first and last.
  - Build answer as 1:3:2 above. and in between say "And then..."

sabszh commented 1 month ago

Idea to test out:

Workaround with prompting of multi-query in terms of LLM to 'understand' time, and test its ability by giving it a story with artificial timestamps, ask a query like: "Talk about how Z character developed throughout time?"

lightnin commented 1 month ago

Here's a relevant paper mentioning a benchmark for understanding time: TRAM: BENCHMARKING TEMPORAL REASONING FOR LARGE LANGUAGE MODELS Yuqing Wang Stanford University ywang216@stanford.edu Yun Zhao Meta Platforms, Inc. yunzhao20@meta.com ABSTRACT Reasoning about time is essential for understanding the nuances of events de- scribed in natural language. Previous research on this topic has been limited in scope, characterized by a lack of standardized benchmarks that would allow for consistent evaluations across different studies. In this paper, we introduce TRAM, a temporal reasoning benchmark composed of ten datasets, encompassing vari- ous temporal aspects of events such as order, arithmetic, frequency, and duration, designed to facilitate a comprehensive evaluation of the temporal reasoning ca- pabilities of large language models (LLMs). We conduct an extensive evaluation using popular LLMs, such as GPT-4 and Llama2, in both zero-shot and few-shot learning scenarios. Additionally, we employ BERT-based models to establish the baseline evaluations. Our findings indicate that these models still trail human per- formance in temporal reasoning tasks. It is our aspiration that TRAM will spur further progress in enhancing the temporal reasoning abilities of LLMs. Our data is available at https://github.com/EternityYW/TRAM-Benchmar

https://arxiv.org/pdf/2310.00835

lightnin commented 1 month ago

This paper seems to suggest that LLMs sort of suck at temporal reasoning right now.

Large Language Models Can Learn Temporal Reasoning

While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they are not without their flaws and inaccuracies. Recent studies have introduced various methods to mitigate these limitations. Temporal reasoning (TR), in particular, presents a significant challenge for LLMs due to its reliance on diverse temporal expressions and intricate temporal logic. In this paper, we propose TG-LLM, a novel framework towards language-based TR. Instead of reasoning over the original context, we adopt a latent representation, temporal graph (TG) that facilitates the TR learning. A synthetic dataset (TGQA), which is fully controllable and requires minimal supervision, is constructed for fine-tuning LLMs on this text-to-TG translation task. We confirmed in experiments that the capability of TG translation learned on our dataset can be transferred to other TR tasks and benchmarks. On top of that, we teach LLM to perform deliberate reasoning over the TGs via Chain of Thought (CoT) bootstrapping and graph data augmentation. We observed that those strategies, which maintain a balance between usefulness and diversity, bring more reliable CoTs and final results than the vanilla CoT distillation.

https://arxiv.org/abs/2401.06853

sabszh / EER-chatbot-UI

How do RAGs / LLMs understand time? #7