Question about the results

Hannibal046 commented 2 weeks ago

Hi, thanks for this insightful analysis about the impact of vec2text to retrieval system! I really enjoy the paper.

I have two short questions:

Does these two rows refer to the same model? If so, why they differ so much?
If I understand correctly, the sentence beam search is actually not a beam search algorithm because it doesn't involve a sequentially decision process and only take the topk at the each step. Do I understand this correctly?

ArvinZhuang commented 2 weeks ago

hi @Hannibal046, thanks for being interested in our work!

yes these two rows refer to the same model and same eval. The difference in the numbers might be due to different testing set (as the testing set is randomly split from the NQ validation set) and different numbers of beams.
yes the beam search used by vec2text is not a token-level beam search but a sentence level search.

Hannibal046 commented 2 weeks ago

Hi, thanks so much for the prompt response!

As for the second point, I am curious that in the actual implementation of vec2text, it seems like a sentence-level greedy top-k search? Because the output of iteration T only depends on the score of T regardless of scores of (T-1,T-2,....1).

ArvinZhuang commented 2 weeks ago

To be honest, Im not 100% sure how sbeam is implemented (the code is kind of complex to understand..). But I agree with you, it's not like traditional beam search where you keep track of the beam sequence, I think the sbeam here just cares about the best generation, this somewhat makes sense? since the goal of vec2text is generating the text with the embedding that has the most similar cos sim to the target embedding.

Maybe @jxmorris12 can answer our question? :)

Hannibal046 commented 1 week ago

Hi, @ArvinZhuang I might got the catch here. The default value of return_best_hypothesis is False, which means the sbeam would be decided by the beam score rather than cosine sim between target text. According the paper, this would have a huge impact.

https://github.com/ielab/vec2text-dense_retriever-threat/blob/13f1288a36aa3b93d92dbfdeff9932449eb5de6c/vec2text/trainers/corrector.py#L449-L452

You can confirm this by adding print(trainer.return_best_hypothesis) in the eval_v2t.py. Could you please confirm this on your side? I have some troubles install the required packages. Specifically, I would encounter problems like:

ArvinZhuang commented 1 week ago

@Hannibal046 Yeah on my side trainer.return_best_hypothesis defaults to False as well.. seems it is hard coded in the trainer? Although, I think the figure you show here is the impact of adding feedback embeddings during iterations?

For your error, might be something related to numpy version? https://github.com/facebookresearch/habitat-sim/issues/2413

Hannibal046 commented 1 week ago

Oh, it is my fault to misunderstand this figure. It is indeed hard coded in the trainer, but I believe this should be a bug for the following reasons:

in the paper, it says it uses cos sim to determine the sbeam process
after resolving the environment problem (appreciate!), I test the model you provided with the following command:
```
python3 eval_v2t.py \
--model_dir ielabgroup/vec2text_gtr-base-st_corrector \
--batch_size 16 \
--steps 50 \
--beam_width 4
```
And I got: return_best_hypothesis BLEU F1 EM Cos

False 90.3 97.0 75.2 0.997

True 95.7 98.8 86.0 0.998

And I got:	return_best_hypothesis	BLEU	F1	EM	Cos
False	90.3	97.0	75.2	0.997
True	95.7	98.8	86.0	0.998

ArvinZhuang commented 1 week ago

Hi @Hannibal046, the results look promising, I think you got a good catch! Also, I think @jxmorris12 was using beam_width 8 in the paper? That might further improve the scores.

I might need to update the paper if @jxmorris12 confirms this.

ArvinZhuang commented 1 week ago

Hi @Hannibal046 I set return_best_hypothesis to True and use beam_width 8, for jxm/gtr__nq__32__correct checkpoint. I got:

eval_bleu_score 98.22788593452714
eval_token_set_f1 0.9945816496220021
eval_exact_match 0.944
eval_emb_cos_sim 0.9980175495147705

which are closer to the original paper now.

Hannibal046 commented 1 week ago

Thanks for verification! It looks amazing! EM even gets 0.944!

ArvinZhuang commented 1 week ago

Thank you for pointing out the issue @Hannibal046 , I am re-running all the eval now and will update the paper accordingly :)

Hannibal046 commented 1 week ago

You are welcome! Also, I want to confirm another thing about the optimization, for this model ielabgroup/vec2text_gtr-base-st_corrector, which is the right optimization arguments?

Adam, LR 2e-4, warmup with decay
AdamW, LR 0.001, constant_with_warmup

The former is from the paper and the latter is from the README.

I write a project called nanoV2T, which basically tries to reproduce the vec2text and strip off the reliance of HF-Trainer with simple and clean code. Currently, it could support stage1 and stage2 training, as well as sbeam inference, but the result is not completely aligned with the original repo. I am trying to figure out. If you are interested, go check it!

ArvinZhuang commented 1 week ago

@Hannibal046 I was using AdamW, LR 0.001, constant_with_warmup Your project is really cool!

Hannibal046 commented 1 week ago

Hi, @ArvinZhuang , I have successfully reproduce the results of vec2text with some inference optimization method (500 samples with 8 beams, 50 iters and 8 GPUs in about 6 minutes). If you are interested, go check it.

ArvinZhuang commented 1 week ago

@Hannibal046 That sounds good! The original inference code to do 2000 samples, 50 steps with beam 8 took me 4.5 hours with a single H100 GPU. I think if use your code that will be something around 3hours?

Btw, is your repo supports loading from our huggingface checkpoints? (ielabgroup/vec2text_gtr-base-st_corrector or jxm/gtr__nq__32__correct)

Hannibal046 commented 1 week ago

Hi, I test my code with 1000 samples, single A100, 50 steps, beam 8 and batch_size 64, it takes about 40mins, so I guess 2000 samples would cost around 1h (or 1h 10 mins). You could verify it with:

accelerate launch --num_processes 1 \
    v2t/inference.py \
        --draft_dir output/gtr_t5_nq_32_stage1/hyps \
        --generator_name_or_path Hannibal046/gtr_t5_nq_32_stage2 \
        --dataset_name_or_path jxm/nq_corpus_dpr \
        --embedder_name_or_path sentence-transformers/gtr-t5-base \
        --max_seq_length 32 --max_eval_samples 1000

For trainer-based checkpoint, such as jxm/gtr__nq__32__correct, we do have a simple workaround inference code. However, we couldn't achieve a very high exact match score (we only got 90). There are a lot of bugs about this transition, such as:

jxm/gtr__nq__32__correct could only generate 31 tokens rather than 32, because it count a special eos token
with jxm/gtr__nq__32__correct , each draft needs a <pad> token before the sentence.

I am not sure if there still exists something I overlooked, if you are interested, hope you could find it! You could verify jxm/gtr__nq__32__correct with:

## install vec2text package first
accelerate launch --num_processes 1   v2t/inference_from_trainer_ckpt.py    \
--draft_dir output/gtr_t5_nq_32_stage1/hyps  \
 --dataset_name_or_path jxm/nq_corpus_dpr   \
 --embedder_name_or_path sentence-transformers/gtr-t5-base \
--max_seq_length 32 --max_eval_samples 500

Hannibal046 commented 1 week ago

Hi, I find a worrying fact: the train set and dev set overlaps a lot....

from datasets import load_dataset
from transformers import AutoTokenizer
from tqdm import tqdm
from functools import partial

def tokenize_fn(examples,tokenizer):
    tokenized_examples = tokenizer(
        examples['text'],
        max_length=32,
        truncation=True,
        add_special_tokens=False,
    )
    truncated_input_text = tokenizer.batch_decode(tokenized_examples['input_ids'],skip_special_tokens=True,clean_up_tokenization_spaces=True)
    return {"text":truncated_input_text}

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base",use_fast=True)
dataset = load_dataset("jxm/nq_corpus_dpr")
dataset = dataset.map(
    partial(tokenize_fn,tokenizer=tokenizer),
    batched=True,
    num_proc=32,
)
train_set = set(dataset['train']['text'])

test_size = 1000
test_text = list(dataset['dev']['text'])
test_sets = [set(test_text[idx:idx+test_size]) for idx in range(0,len(test_text),test_size)]
overlap = []
for test_set in tqdm(test_sets):
    overlap.append(len(train_set & test_set))
print(sum(overlap)/len(overlap))

And it gives:

505.83882352941174

jxmorris12 commented 1 week ago

Hmm this is a little bit worrying. I'm a little confused at what the metric is that you're logging -- are you claiming that on average, 50% of the test set overlaps with the training set? Why don't you just compute the total overlap?

Anyway, I don't think it's that big of a deal. We measure results on lots of other datasets (all of BEIR) and see similar (although slightly lower) performance. And I think our results would still be interesting if we only ran on training points.

If you want to look into this further, I got the data from DPR: I just downloaded it and uploaded to HuggingFace. Do they have train-test overlap in their experiments as well? I think that would be somewhat more notable, although DPR is relatively outdated by now.

jxmorris12 commented 1 week ago

Also can you confirm for posterity: you were able to reproduce the paper's results by setting the beam width to 8 and return_best_hypothesis to true? I can try and explain how sequence-level beam search works better if you want.

By the way, I trained this model for a little longer so it's expected that the results are better than those in the paper; this is what i got too

ArvinZhuang commented 1 week ago

Hi @jxmorris12 , thanks for the information. How did you process dpr dataset? I think its originally a QA dataset that has relevant query/passage pairs? Did you take passages in the training set and passages in the dev set to form v2t training/valid sets? Maybe the issue is the dpr dataset only has a non-overlap between training queries and dev queries? it is making sense that one passage can answer a query in the training set and also can answer one in the dev set, so that passages can have overlap.

jxmorris12 commented 1 week ago

Yeah I agree that makes sense from a retrieval perspective, but I think it's still bad ML practice if they're using the same passages in a test-time corpus as at train-time.

ArvinZhuang commented 1 week ago

@jxmorris12 This is an interesting point since the standard practice in the information retrieval community now is fixing the corpus the same for both training and test queries. This is true for all the IR benchmark datasets such as MS MARCO and BEIR. I believe (though I am not 100% sure) that many of these datasets have queries in the test set that have a relevant passage appearing in the training set as well. Maybe it is a good time to investigate how big an impact this overlap has on these IR datasets. :)

Anyway, I'm going to re-evaluate vec2text on the NQ validation passages that are not in the training set. Let's see how it goes.

jxmorris12 commented 1 week ago

🤯 Let me know!!

ArvinZhuang commented 1 week ago

Hi @jxmorris12 @Hannibal046. here is what I did for removing dev points that in train set:

def load_nq_dpr_corpus() -> datasets.Dataset:
    pass_dataset = datasets.load_dataset("jxm/nq_corpus_dpr")

    # filter out the dev set from the train set
    train_set = set(pass_dataset['train']['text'])
    dev_set = set(pass_dataset['dev']['text'])
    print("overlapped train and dev for nq:", len(train_set & dev_set) / len(dev_set))  # 0.5057715760181187
    dev_set_no_overlap = list(dev_set.difference(train_set))
    dev_set_no_overlap = Dataset.from_dict({'text': dev_set_no_overlap})
    pass_dataset["dev"] = dev_set_no_overlap

    train_set = set(pass_dataset['train']['text'])
    dev_set = set(pass_dataset['dev']['text'])
    print("overlapped train and dev for nq after filter:", len(train_set & dev_set) / len(dev_set))  # 0.0

return pass_dataset

now the scores become:

eval_bleu_score 97.30699922781523
eval_token_set_f1 0.9918264593562687
eval_exact_match 0.922
eval_emb_cos_sim 1.0

Only slightly lower than before. So I think we can relief now? :)

Hannibal046 commented 1 week ago

Nice job! Thanks @jxmorris12 for this innovative work and thanks @ArvinZhuang for insightful analysis!

Hannibal046 commented 1 week ago

Also can you confirm for posterity: you were able to reproduce the paper's results by setting the beam width to 8 and return_best_hypothesis to true? I can try and explain how sequence-level beam search works better if you want.

By the way, I trained this model for a little longer so it's expected that the results are better than those in the paper; this is what i got too

I could confirm this.

ielab / vec2text-dense_retriever-threat

Question about the results #1