word loss calculation - Githubissues

huawei-lin / RapidIn

The implementation for paper "Token-wise Influential Training Data Retrieval for Large Language Models" (Accepted on ACL 2024).

Other

9 stars 1 forks source link

word loss calculation #1

Open p1nksnow opened 1 month ago

p1nksnow commented 1 month ago

The grad_z function does not seem to support calculating the influence of a single word. There is return_words_loss in the function parameter, but this parameter is not used in the function body. Is this part of the code the final version?

p1nksnow commented 1 month ago

I am also confused about the hvp function. What part of the paper does this correspond to? Thanks for any helpful response.

huawei-lin commented 1 month ago

Sorry for the confusion. I removed the feature and have not added it back yet. The return_words_loss should be "return_tokens_loss". When it is true, it will work in this way (batch_size = 1):

tokens_grad_list = []
if return_tokens_loss == True:
    for i, token_loss in enumerate(loss):
        if label[i] == IGNORE_INDEX:
            tokens_grad_list.append(None)
            continue
        grad = torch.cat([x.reshape(-1) for x in list(grad(token_loss, params, retain_graph=True))])
        tokens_grad_list.append(grad)

It calculates the gradient for each token if the label exists. The params should be all the trainable parameters.

The hvp function is for influence function we compared. Our RapidIn does not use hvp.

Edit:

I am not sure if the backward can do the same thing:

loss[i].backward()
# or
model.backward(loss[i])

p1nksnow commented 1 month ago

Thank you for you reply! I have one more question, the influence function

The hvp function is for influence function we compared.

is it TracIn or something else?

huawei-lin commented 1 month ago

No, it is Influence Function: Koh, Pang Wei, and Percy Liang. "Understanding black-box predictions via influence functions." International conference on machine learning. PMLR, 2017.

p1nksnow commented 1 month ago

Could you provide a config_caching.json and the corresponding deepspeed config file on how to use deepspeed multi-GPU parallelism?

huawei-lin commented 1 month ago

Our multi-GPU parallelism is not implemented by using deepspeed. The deepspeed implementation in RapidIn has not tested yet.

If you have multiple gpus, the multi-GPU parallelism should be auto-enabled. You can see you will have different processes in your gpus. And you can set the n_threads to increase the number of threads of each GPU to maximize the use of GPU memory.