salesforce / fast-influence-functions

BSD 3-Clause "New" or "Revised" License
83 stars 17 forks source link

Issue with continuous-value-predicting LSTM results #18

Closed jdiaz4302 closed 2 years ago

jdiaz4302 commented 2 years ago

As a first step in using these tools, I am trying to get training set influence for a small LSTM (~1000 weights) and toy-sized data set (train_n and test_n = 100).

After making very few adjustments (i.e., a more explicit call to mse_loss for my LSTM workflow and changing the expected input structure from Dict[str, torch.Tensor] to [torch.Tensor, torch.Tensor] which is [x, y]), I can get results running the following

influence_utils.nn_influence_utils.compute_influences(
  0,
  'cpu',
  model,
  [x_test[[46]], y_test[[46]]],
  train_loader,
  train_loader,
  s_test_num_samples = 100)

The results are consistent between multiple calls to the function and closely match that of a different repo, but they correlate very poorly with leave-one-out training results.

fif_influence_results

Do you know if I am using this incorrectly or if there is any fixable reason why the implementation may perform poorly for an LSTM predicting continuous values

If useful, my fork containing those minimal changes can be found here - https://github.com/jdiaz4302/fast-influence-functions

HanGuo97 commented 2 years ago

Great question. Empirically, I noticed that having some weight decay help. This paper mentioned a few details that might be useful for you.

jdiaz4302 commented 2 years ago

Thank you for the reply and reference. Unfortunately, I'm not seeing any improvements with different weight decay values (in the training loop and when supplied to the compute_influences function)

Empirically, I would also love to know if you've always found this weight decay approach to solve the issue or if the issue is sometimes unresolvable? The reference seems to say that it is sometimes unresolvable for large neural networks, but my example is rather small (one LSTM layer with hidden size 10 followed by a dense layer; input sequence has 2 variables)

HanGuo97 commented 2 years ago

Interesting. When I said "weight decay", I was referring to the weight decay used in training the model itself. The intuition is that, very loosely speaking, l2-regularization makes the loss a bit more convex.

Also, given that the model is rather small, you can debug this via "exactly" computing the optimal influence value. That is, exactly/numerically solve the Hessian (without approximation), and others, and gradually relax the exactness. IIRC, this is how the initial influence function analyzed the behavior of the algorithm. This would take quite a bit of time, unfortunately.

jdiaz4302 commented 2 years ago

Thank you for the clarification on weight decay and that intuition seems sounds.

I wanted to start with this very small model only because I can verify the results with leave-one-out training in reasonable time (i.e., retrain a small model 100 times to check the influence answers). My goal is to get trustworthy answers so that I can use these methods on large models and data sets where I will not be able to retrain lots of times.

Would you be able to help get this small LSTM time series example working? This would possibly help me and my colleagues use these methods in our work and add another example for your repo

HanGuo97 commented 2 years ago

Agree -- I think this is the right way to get started!

Unfortunately, my familiarity with time series is very limited, but contributions from others are welcome.

jdiaz4302 commented 2 years ago

Same as below, but I didn't actually close issue

jdiaz4302 commented 2 years ago

Okay, I will close the issue as not planned for now, but submit a PR if we make any progress