Question on the computation of P(True) baseline

Hi @lorenzkuhn,

I wanted to bring to your attention a potential error in the computation of the P(True) baseline, unless I have misunderstood something here.

Currently, in the code snippet, only the first len(tokenized_base_prompt) targets are to -100: https://github.com/lorenzkuhn/semantic_uncertainty/blob/27adbf0dc1bf056c771c205d89c2a79cbd82dc3a/code/get_prompting_based_uncertainty.py#L108-L113

However, it seems that this approach does not ignore the entire context tokens when calculating the NLL loss, as the prompt_true also includes the few_shot_prompt prior to the base_prompt: https://github.com/lorenzkuhn/semantic_uncertainty/blob/27adbf0dc1bf056c771c205d89c2a79cbd82dc3a/code/get_prompting_based_uncertainty.py#L105-L106

It seems not ignoring the entire context tokens for the NLL loss computation might lead to inaccurate results. Could you please provide some insights or clarification on this matter?

In addition, the current codes only use the n_samples_to_use = 2000 samples for P(True) baseline. Do the experiment settings for the P(True) and the others different? I don't recall reading any explanation in the paper.

lorenzkuhn / semantic_uncertainty

Question on the computation of P(True) baseline #6