[Bug] Initial Loss Value Unavailable in TensorBoard Logs

DanieleCucurachi commented 4 months ago

The train() function within ml_tools/train_grad.py exhibits unexpected behavior as it initiates the TensorBoard logging after the first training iteration. Consequently, we do not have access to the value of the loss before the first optimization step. When examining the training logs, the earliest loss value available is after the optimization has started, which is tagged as iteration 0. It would be more appropriate to log the pre-optimization loss as iteration 0.

dominikandreasseitz commented 4 months ago

hi @DanieleCucurachi, we will double-check but afaik the loss at iteration 0 should correspond to the loss before applying the first optimization step

DanieleCucurachi commented 4 months ago

Hi @dominikandreasseitz, as far as I see in train(), write_tensorboard() is called after optimize_step(). Furthermore, when I retrieve the values of the loss during training for identical models (same circuit and initial weights) trained using different optimizers, the values at the first iteration 0 are not consistent.

I was about to fix it myself, should I wait for someone to check first?

# outer epoch loop
for iteration in progress.track(range(init_iter, init_iter + config.max_iter)):
    try:
        # in case there is not data needed by the model
        # this is the case, for example, of quantum models
        # which do not have classical input data (e.g. chemistry)
        if dataloader is None:
            loss, metrics = optimize_step(
                model=model,
                optimizer=optimizer,
                loss_fn=loss_fn,
                xs=None,
                device=device,
                dtype=data_dtype,
            )
            loss = loss.item()

        elif isinstance(dataloader, (DictDataLoader, DataLoader)):
            loss, metrics = optimize_step(
                model=model,
                optimizer=optimizer,
                loss_fn=loss_fn,
                xs=next(dl_iter),  # type: ignore[arg-type]
                device=device,
                dtype=data_dtype,
            )

        else:
            raise NotImplementedError(
                f"Unsupported dataloader type: {type(dataloader)}. "
                "You can use e.g. `qadence.ml_tools.to_dataloader` to build a dataloader."
            )

        if iteration % config.print_every == 0 and config.verbose:
            print_metrics(loss, metrics, iteration)

        if iteration % config.write_every == 0:
            write_tensorboard(writer, loss, metrics, iteration)

        if config.folder:
            if iteration % config.checkpoint_every == 0:
                write_checkpoint(config.folder, model, optimizer, iteration)

dominikandreasseitz commented 4 months ago

Hi @dominikandreasseitz, as far as I see in train(), write_tensorboard() is called after optimize_step(). Furthermore, when I retrieve the values of the loss during training for identical models (same circuit and initial weights) trained using different optimizers, the values at the first iteration 0 are not consistent.

I was about to fix it myself, should I wait for someone to check first?

# outer epoch loop
for iteration in progress.track(range(init_iter, init_iter + config.max_iter)):
    try:
        # in case there is not data needed by the model
        # this is the case, for example, of quantum models
        # which do not have classical input data (e.g. chemistry)
        if dataloader is None:
            loss, metrics = optimize_step(
                model=model,
                optimizer=optimizer,
                loss_fn=loss_fn,
                xs=None,
                device=device,
                dtype=data_dtype,
            )
            loss = loss.item()

        elif isinstance(dataloader, (DictDataLoader, DataLoader)):
            loss, metrics = optimize_step(
                model=model,
                optimizer=optimizer,
                loss_fn=loss_fn,
                xs=next(dl_iter),  # type: ignore[arg-type]
                device=device,
                dtype=data_dtype,
            )

        else:
            raise NotImplementedError(
                f"Unsupported dataloader type: {type(dataloader)}. "
                "You can use e.g. `qadence.ml_tools.to_dataloader` to build a dataloader."
            )

        if iteration % config.print_every == 0 and config.verbose:
            print_metrics(loss, metrics, iteration)

        if iteration % config.write_every == 0:
            write_tensorboard(writer, loss, metrics, iteration)

        if config.folder:
            if iteration % config.checkpoint_every == 0:
                write_checkpoint(config.folder, model, optimizer, iteration)

sure but the loss returned by the first optimize_step call should be the initial one (before having applied any optimization). i can try to have a look today, but if you want to have a crack at it yourself, youre very welcome

dominikandreasseitz commented 4 months ago

@chMoussa can you have a look at this? cc @Roland-djee

chMoussa commented 4 months ago

Can we have your code example to work on @DanieleCucurachi ?

DanieleCucurachi commented 4 months ago

Here is a simple snippet that exposes the problem:

from pathlib import Path
import torch
from functools import reduce
from operator import add
from itertools import count
import matplotlib.pyplot as plt
from qadence import Parameter, QuantumCircuit, Z
from qadence import hamiltonian_factory, hea, feature_map, chain
from qadence.models import QNN
from qadence.ml_tools import  TrainConfig, train_with_grad, to_dataloader, DictDataLoader

DEVICE = torch.device('cpu')
DTYPE = torch.complex64
SEED = 42
torch.manual_seed(SEED)
n_qubits = 4
fm = feature_map(n_qubits)
ansatz = hea(n_qubits=n_qubits, depth=3)
observable = hamiltonian_factory(n_qubits, detuning=Z)
circuit = QuantumCircuit(n_qubits, fm, ansatz)
model = QNN(circuit, observable, backend="pyqtorch", diff_mode="ad")
batch_size = 100
input_values = {"phi": torch.rand(batch_size, requires_grad=True)}
pred = model(input_values)
cnt = count()
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
def loss_fn(model: torch.nn.Module, data: torch.Tensor) -> tuple[torch.Tensor, dict]:
    next(cnt)
    x, y = data[0], data[1]
    out = model(x)
    loss = criterion(out, y)
    return loss, {}
def validation_criterion(
    current_validation_loss: float, current_best_validation_loss: float, val_epsilon: float
) -> bool:
    return current_validation_loss <= current_best_validation_loss - val_epsilon
n_epochs = 10
config = TrainConfig(
    max_iter=n_epochs,
    print_every=1,
    batch_size=batch_size,
    checkpoint_best_only=True,
    val_every=1,  # The model will be run on the validation data after every `val_every` epochs.
    validation_criterion=validation_criterion,
    folder='./tmp/exqnn'
)
fn = lambda x, degree: .05 * reduce(add, (torch.cos(i*x) + torch.sin(i*x) for i in range(degree)), 0.)
x = torch.linspace(0, 10, batch_size, dtype=torch.float32).reshape(-1, 1)
y = fn(x, 5)
data = DictDataLoader(
    {
        "train": to_dataloader(x, y, batch_size=batch_size, infinite=True),
        "val": to_dataloader(x, y, batch_size=batch_size, infinite=True),
    }
)
print("Initial loss", loss_fn(model, (x,y))[0].item())
train_with_grad(model, data, optimizer, config, loss_fn=loss_fn,device=DEVICE, dtype=DTYPE)
plt.clf()
plt.plot(x.numpy(), y.numpy(), label='truth')
plt.plot(x.numpy(), model(x).detach().numpy(), "--", label="final", linewidth=3)
plt.legend()

Once run, check tensorboard --logdir tmp

chMoussa commented 4 months ago

Thanks @DanieleCucurachi. From the function definition, we log actually the values after optimization, so we should increment the init_iter in this function to be consistent with other ML frameworks. I understand though that it would be useful to log the pre-training evaluations in tensorboard. This can be done via adding an extra argument in TrainConfig and quickly modifying the train function to do so. I will add this feature if you agree, @dominikandreasseitz and @DanieleCucurachi ?

dominikandreasseitz commented 4 months ago

@chMoussa sounds good, tag me on the PR thank you!

DanieleCucurachi commented 4 months ago

@chMoussa great, please tag me as well. Thank you!

chMoussa commented 4 months ago

Hey, eventually by looking more carefully at the train function, it turns out that

If we were doing the i-th optimization step of gradient-descent, it would log the value of the training loss at the (i-1)-th iteration in tensorboard (as we call loss_fn before applying step() in optimize_step), and the validation values at the i-th iteration would be log under the same iteration number.
In tensorboard, we could log math.nan for the loss value.

I plan to solve these inconsistencies for the new feature.

pasqal-io / qadence

[Bug] Initial Loss Value Unavailable in TensorBoard Logs #444