salesforce / fast-influence-functions

BSD 3-Clause "New" or "Revised" License
82 stars 18 forks source link

Questions about influence functions #20

Open acphile opened 1 year ago

acphile commented 1 year ago

Hi, thanks for your nice work. I am planning to use influence functions in other settings and I have two question:

  1. In general, do "helpful" data points with respect to a evaluation test point share the same label or do "harmful" data points with respect to a evaluation test point have a different label ?
  2. Is the magnitude of influences scores related to whether the low layers of the PLM are freezed during training? In my preliminary trial, I find that when the model is finetuned without any layer being freezed, it seems the influence scores would be quite small (most around 1e-9)
HanGuo97 commented 1 year ago

Hi, thanks for asking these great questions!

  1. I found section 2.3 in [1] very illuminating when it comes to this.
  2. It has been observed that the "numerical values" of influence scores are usually not correct. Instead, their rankings do seem to be more useful [2]. In that sense, I would pay more attention to the rankings instead of numerical scores.

[1] https://arxiv.org/pdf/1703.04730.pdf [2] https://arxiv.org/pdf/1905.13289.pdf

acphile commented 1 year ago

Thanks for your answer. Regarding the score issue, to what extent can we leverage the numberical scores? For example, could score |x|<eps (close to zero) be viewed as non-influential data point?

HanGuo97 commented 1 year ago

It is task-dependent, I guess. For example, consider if you arbitrarily scale the loss by a constant; would that change the influence values? (I actually don't know.)

Practically speaking, how about plotting things on a histogram and observing which ones are outliers?