Using Generative Models

phiwi commented 10 months ago

Also from my side big thanks for sharing your work!

In your publication you state that (with minor amendments) one could also use generative models to extract KG tuples. Is this already available somewhere in you implementation? And if not: Would you mind providing a rough sketch how one should do it?

xiyan128 commented 10 months ago

Hi, thanks for the question. To answer your first question -- we should note that the implementation in this repo is only for BERT like language models.

To reiterate the problem BertNet posed for autoregressive LM: the objective is to extract a set of valid KG-style tuples from an autoregressive language model, subject to certain constraints.

Problem Statement

Let $\mathcal{M}$ be a pretrained language model.
Let $S$ be the set of all possible sequences $s$ that $\mathcal{M}$ can generate.
Let $T$ be the set of all potential knowledge graph (KG) tuples $\langle h, r, t \rangle$ encoded within $\mathcal{M}$.
Given a query $q$, the goal is to find a maximum subset $S' \subseteq S$ such that for every $s \in S'$, two conditions are met:
1. Structural Condition: $s$ can be parsed into a valid KG tuple $\langle h, r, t \rangle$, where $h, t \in \mathcal{E}$ (the set of entity phrases) and $r \in \mathcal{R}$ (the set of relation phrases).
2. Confidence Condition: $s$ adheres to the model's probabilistic integrity, i.e., $\log P(s \mid \mathcal{M}) \ge \theta$, where $\theta$ is a predefined threshold.

Sketch

Since I've been away from this problem for a while, I'm OK sharing this intuitive/high level sketch I had earlier.

1. Approximating the Search Space: Sampled Beam Search

Why: Extracting tuples from $S$ is NP-complete. To approximate the solution, we employ a heuristic.
How: Implement Sampled Beams Search to generate a set of candidate sequences $S_q$ conditioned on a query $q$.
- Iterative Batch Generation: Since the beam width $k$ is restricted by computational resources, we use an iterative approach to generate $Sq$ as $\bigcup{j=1}^{p} {s{j,1}, s{j,2}, \ldots, s_{j,k}}$.
- In practice, the query can be cloze-style questions or autoregressive prompts for relations or entities, and the prompt design can make a major difference.

2. Structural Compliance: KG-Style Tuple with Constrained Search

Why: To ensure that the generated sequences can be parsed into KG tuples.
How: Introduce structure-aware constraints during the beam search to enforce that sequences must be in the form $\langle h, r, t \rangle$, adhering to entity and relation types. In practice, constrained beam search can be helpful.

3. Probabilistic Integrity: LM's Faith

Why: Variations in the log likelihoods of sequences could lead to suboptimal extraction.
How: Utilize a set of paraphrased queries $\mathcal{Q} = {q_1, q_2, \ldots, q_m}$ to generate and rescore candidate sequences.
- Aggregation: Compute log likelihoods $\log P(s \mid q_i, \mathcal{M})$ for each $q_i$ and aggregate the scores to select the final set $S'$.
- Optimization: The paraphrase set $\mathcal{Q}$ can be fine-tuned using seed entities to improve accuracy.

This is the sketch we experimented before, and the results are reasonably good although we are lacking more rigorous evaluation because of the unique nature of the problem setting. Let me know if you are interested in the implementation details of this sketch!

phiwi commented 10 months ago

Hey Xiyan,

thank you very much for you extensive answer. I didn't expect such an in-depth theoretic explication. Now, from my understanding and conforming my high-level assumptions a combination of prompting and sampling with (constrained) beam search might lead to the desired results (similar triples as for the MLM approach).

If you don't mind sharing your peliminary implementations I would greatly be happy to be inspired by your ideas.

tanyuqian / knowledge-harvest-from-lms