Wrong Gradients when TorchConnector is used with a batch

stfnmangini commented 3 years ago

Information

Qiskit Machine Learning version: 0.1.0
Python version: 3.8.8
Operating system: MacOS Big Sur

What is the current behavior?

If a CircuitQNN is used with a TorchConnector, the result PyTorch model have some issues calculating the gradients of the parameters in the circuit, when the model is evaluated on a batch of data, and not a single sample.

Steps to reproduce the problem

Here is an example to reproduce the problem. I use the same circuitry defined in the tutorial (https://github.com/Qiskit/qiskit-machine-learning/blob/master/docs/tutorials/05_torch_connector.ipynb), using CircuitQNN and the TorchConnector to create a quantum neural network. I try to evaluate the gradients of the parameters on a regression task with a trivial dataset, consisting of 20 identical inputs and corresponding targets.

As a loss function, I consider the MSELoss with reduction=sum, and I try to evaluate the loss and its gradients in different ways:

Use PyTorch MSELoss on the full dataset (consisting of 20 identical item)
Manually evaluate the loss as (output-target).pow(2).sum() again on the whole dataset
Manually evaluate the single losses for each item in the dataset, and sum them using a for loop
Evaluate the loss only for a single data

Then, the gradients are evaluated using the loss.backward() and extracted with model.weights.grad. Note that there is no optimizer step! I only evaluated the gradients without updating the weights. All these methods should be fully equivalent, since the data are always the same, and there is no seed, ordering, or strange twists. Note that while Methods 1, 2 and 3 use the full dataset of 20 samples, Method 4 uses only a single item, so its gradient is expected to be 20 times smaller (since we are using MSELoss(reduction="sum")).

### Imports
import numpy as np
import matplotlib.pyplot as plt
from torch import Tensor
import torch
from torch.nn import MSELoss
from torch.optim import SGD
from qiskit  import Aer, QuantumCircuit
from qiskit.utils import QuantumInstance
from qiskit.opflow import AerPauliExpectation
from qiskit.circuit import Parameter
from qiskit.circuit.library import RealAmplitudes, ZZFeatureMap
from qiskit_machine_learning.neural_networks import CircuitQNN, TwoLayerQNN
from qiskit_machine_learning.connectors import TorchConnector
qi = QuantumInstance(Aer.get_backend('statevector_simulator'))

### Create QNN
num_inputs = 2
feature_map = ZZFeatureMap(num_inputs)
ansatz = RealAmplitudes(num_inputs, entanglement='linear', reps=1)

qc = QuantumCircuit(num_inputs)
qc.append(feature_map, range(num_inputs))
qc.append(ansatz, range(num_inputs))

parity = lambda x: '{:b}'.format(x).count('1') % 2
output_shape = 2  # parity = 0, 1

qnn2 = CircuitQNN(qc, input_params=feature_map.parameters, weight_params=ansatz.parameters, 
                  interpret=parity, output_shape=output_shape, quantum_instance=qi)

# set up PyTorch module
initial_weights = np.array([0.1]*qnn2.num_weights)
model2 = TorchConnector(qnn2, initial_weights)

### Trivial dataset
X = Tensor(np.stack(([0.5, 0.5],)*20))
y = Tensor(np.stack(([-0.5, -0.5],)*20))

### Define optimizer and loss
optimizer = SGD(model2.parameters(), lr = 0.1)
f_loss = MSELoss(reduction = "sum")

### Method 1
output1 = model2(X)
loss1 = (output1-y).pow(2).sum()
optimizer.zero_grad()
loss1.backward()
print("Loss:", loss1) # -> 
print("Gradients:", model2.weights.grad) 
# Loss: tensor(40.0025, grad_fn=<SumBackward0>)
# Gradients: tensor([1.1921e-07, 1.9073e-06, 2.3842e-07, 1.1921e-07])

### Method 2
output2 = model2(X)
loss2 = f_loss(output2, y)
optimizer.zero_grad()
loss2.backward()
print("Loss:", loss2)
print("Gradients:", model2.weights.grad)
# Loss: tensor(40.0025, grad_fn=<MseLossBackward>)
# Gradients: tensor([1.1921e-07, 1.9073e-06, 2.3842e-07, 1.1921e-07])

### Method 3
loss3 = 0.0
for xt, yt in zip(X,y):
    output3 = model2(xt)
    loss3 += f_loss(output3, yt)
optimizer.zero_grad()
loss3.backward()
print("Loss:", loss3)
print("Gradients:", model2.weights.grad)
# Loss: tensor(40.0025, grad_fn=<AddBackward0>)
# Gradients: tensor([-0.0311,  0.3078,  0.0567, -0.0312])

### Method 4
output4 = model2(X[0])
loss4 = f_loss(output4, y[0])
optimizer.zero_grad()
loss4.backward()
print("Loss:", loss4)
print("Gradients:", model2.weights.grad)
# Loss: tensor(2.0001, grad_fn=<MseLossBackward>)
# Gradients: tensor([-0.0016,  0.0154,  0.0028, -0.0016])

What is the expected behavior?

The gradients should be all equals. In particular, evaluating the loss using a batch of data (Methods 1 and 2) yields vanishing gradients. Note that if one substitutes the quantum model created through Qiskit, with a simple model2 = torch.nn.Linear(2,2) then the gradients are correctly equals, so the problem is somewhere in the Qiskit's Machine Learning module. (to run the code above with the classical linea layer, also substitute model2.weights.grad with model2.weight.grad).

qiskit-community / qiskit-machine-learning