Questions - Githubissues

GreenWizard2015 commented 2 years ago

Hello.

First of all, thank you for the great articles and research. Currently, I'm working on software for people with disabilities and trying to apply your research in it. Unfortunately, I'm not very familiar with the area of mutual information estimation and I have a bunch of questions. I hope you provide some brief answers.

Can we reuse old samples? Can I collect some new samples with the current interface, but also leave samples collected with the old interface?
Can we reuse the old estimator or does it needs to be trained from scratch? If we reuse the old estimator, it can drastically boost convergence, but also can introduce a bias. We need to provide fast feedback to the user and can't train an estimator/interface for too long, so I trying to find some solutions to overcome this limitation.
In my project, I have fairly big and complex raw observations. I train a network that takes them and, in a supervised fashion, estimates the desired action. Can I use latent representation from that network as input for MIMI? The first network provides fairly good estimations, but they are not always intuitive, so I wanna apply MIMI and train policy to sample more intuitive actions from the estimations. I'm concerning that latent representation could provide data leakage and MIMI would be collapsed to a trivial solution. Obviously, I can just use other ways to reduce inputs, but they would require additional computations and may introduce other problems.
As I see, we are using 1 network to estimate scores of joint (x, y) pairs and n_mine_samp=32 networks to estimate scores of marginal pairs. Is it just the default implementation or have you tried separable estimators (f(x, y) = g(x) * h(y), so we process only 2N "items" instead of N N) and they failed? In my opinion, it may be much more efficient, but I don't know how crucial a role here is bias/variance. Theoretically, separable estimators cover all pairs, 32 N N, instead of just 32 N, so can be more restrictive. They have issues with bias/variance, but we are using 32 of them, so it may be not a problem. At the same time, we have access to all marginals, so may estimate tf.reduce_mean(tf.exp(shuffled_stats), axis=1) more precisely. What do you think about that? Of course, it would be better just to test it, but it requires some time and effort.

Best regards.

rddy commented 2 years ago

Learning from samples collected with an old interface (i.e., off-policy RL) would be a bit difficult in this setting. The problem is that the state of the MDP actually includes the user's internal model of the interface, so when you go back and sample old transitions from a previous interface, you will only get partial observations that do not include this aspect of the state. I think you can address this partial observability by using a recurrent neural network architecture for the policy and value functions that takes a history of observations and commands as input (instead of only the most recent observation and command). You would also need to use importance sampling to correct for the non-stationary state distribution in the replay buffer (see Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning and DADS).
You could indeed use a warm start to speed up training of the MI estimator, although I haven't tried it yet. If you find that this unduly biases the optimization, then you could try other ways to speed up optimization of the MI estimator, such as taking fewer gradient steps (this tends to be okay since all we typically care about are the relative values of the MI estimates for different interfaces).
That sounds like a promising way to do dimensionality reduction. I think it could actually work even in the case where the latent representation contains mostly information about the predicted action, and has discarded most other information that was originally in the high-dimensional command signal. Even though MIMI uses the MI between this latent and the state transition to compute a reward, the interface itself can be a function of the original command signal. Hence, you could initially train the interface via supervised learning, then fine-tune it using MIMI. MIMI would not necessarily converge to the same solution as the supervised pre-training, since the $\mathcal{I}(\mathbf{s}_t, f(\mathbf{x}_t))$ term (where $f$ is your pre-trained embedding model) is not necessarily maximized to begin with.
The code isn't super clear, but in this line we are actually reusing the same statistics network $T_{\phi}$ as in this earlier line, so it's just 1 network rather than 1+n_mine_samp networks. n_mine_samp just refers to the number of samples we use to compute a Monte Carlo estimate of the expectation in Equation 2.

Happy to discuss further! Also happy to provide more hands-on help with coding or setting up experiments.

GreenWizard2015 commented 2 years ago

The code isn't super clear, but in this line we are actually reusing the same statistics network Tϕ as in this earlier line, so it's just 1 network rather than 1+n_mine_samp networks. n_mine_samp just refers to the number of samples we use to compute a Monte Carlo estimate of the expectation in Equation 2.

My bad, I thought that each call of build_model would create a unique network, so we would have 32 + 1 + 1 networks. It would be greater if you add a notice to the build_model and specify that it creates a unique MLP per scope.

Learning from samples collected with an old interface (i.e., off-policy RL) would be a bit difficult in this setting. The problem is that the state of the MDP actually includes the user's internal model of the interface, so when you go back and sample old transitions from a previous interface, you will only get partial observations that do not include this aspect of the state. I think you can address this partial observability by using a recurrent neural network architecture for the policy and value functions that takes a history of observations and commands as input (instead of only the most recent observation and command). You would also need to use importance sampling to correct for the non-stationary state distribution in the replay buffer (see Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning and DADS).

I meant reusing samples only for MI training. As you wrote in the article, we must collect data, train MI and then adapt the interface (it doesn’t matter whether we are using RL or another approach.). You can somehow perform the adaptation stage without completely new data (offline algorithms, re-labeling old data with new rewards, synthetic data, etc.). The main bottleneck is MI training. It requires the active participation of the user in order to gather data about the new interface. Thus, the problem of effective use of data for MI training arises. Theoretically, only the "intuitiveness" of the transition (s, a) -> s` is important to us, so any transitions can be used, even from old interfaces. Am I correct or there are some restrictions on the data for MI?

rddy commented 2 years ago

The problem is that the intuitiveness of the transition (s, a, s') cannot be evaluated in isolation. For example, an intuitive interface for scrolling on a mobile phone could either involve swiping up to scroll up or swiping down to scroll up, but some kind of mixture of the two interfaces would be unintuitive. That being said, it might be possible to speed up MI estimation through warm starts or meta-learning.

GreenWizard2015 commented 2 years ago

I completely agree and also thought about the problem of "mirror" interfaces. However, I believe that people tend to do things they are used to. For example, if it is more convenient for a person to swipe up, and the interface requires a swipe down, then the person will swipe down with a smaller amplitude or other differences. A person cannot give absolutely independent feedback for each interface, he will remember the previous one and try to control the new one in the same way. If we do not have fully discrete variables, then the difference should be observed. Moreover, your article is based precisely on this assumption, therefore, I think, it is possible to identify more convenient actions and not just the interface. However, this task is more difficult and therefore it is possible that it cannot be solved in practice (requires more resources).

Thank you for your responses and for making it clear that there is no fundamental reason not to try to reuse the data.

If possible, I would be grateful if you suggest articles, resources, etc. on the use of AI to improve accessibility for people with disabilities.

rddy commented 2 years ago

Here are some projects that I think are cool and have potential in this space:

rddy / mimi

Questions #4