franamor98 commented 6 months ago

Machine Learning setup and training issues

In this issue I will update the current setup for training, results and problems I encounter to converge a good solution.

Data

The Data used comes from a 24h (1h as output resolution) simulation over the whole arctic. I am focusing on a small region (400km radius) in the central arctic. The goal is to predict the next position of each vertex. For training I use all the vertexs at time 1,3,5,7,9,11 and for validation I use the same area at time 12,14,16.

Representation and input

As representation of the mesh I use two graphs, one containing vertex information and another using element information. The input is composed of a central vertex with all neighbouring elements/vertex around it.

As input features for the vertex graphs I use the position x,y and forcing fields (wind,ocean). For the element-graphs I use position x,y, Thickness and Concentration. All inputs are standarized.

Model

The model proccess both graphs using GCNN and the combine both feature vectors to predict the velocity of the vertex 'class GCNN_2G(nn.Module):

def __init__(self, num_features1,num_features2, hidden_channels, output_size,dropout=0.1):
    super(GCNN_2G, self).__init__()
    # conv layers as a test [WIP]
    self.conv1 = ChebConv(num_features1, hidden_channels,K=1,dropout=dropout)
    self.conv2 = ChebConv(num_features2, hidden_channels,K=1,dropout=dropout)
    self.fc = nn.Linear(hidden_channels, output_size)

def forward(self,g1,g2):
   #proccess first graph
    x1 = self.conv1(g1.x, g1.edge_index,g1.edge_attr)
    x1 = F.relu(x1)
    #proccess second graph
    x2 = self.conv2(g2.x, g2.edge_index,g2.edge_attr)
    x2 = F.relu(x2)

    x2 = global_mean_pool(x2, g2.batch)
    x1 = global_mean_pool(x1, g1.batch)      

    #concatenate both outputs
    x = torch.stack([x1,x2],dim=0)
    x = torch.mean(x,dim=0)
    # Fully connected layer for the final output
    x = self.fc(x)

    return x`

Loss Function

Custom loss function to account for velocity angle and position. Can be parametrized using A,B,C coefficients.

class CustomIceLoos(nn.Module): def init(self, A=1,B=0,C=0,step=1,d_time=3600): super(CustomIceLoos, self).init() self.mae = nn.L1Loss() self.A = A self.B = B self.C = C self.step = step self.d_time = d_time

def velocity_angle(self,real_v,v_pred):
    v_pred = v_pred / v_pred.norm(p=2,dim=1).unsqueeze(1)
    real_v = real_v / real_v.norm(p=2,dim=1).unsqueeze(1)
    dot = (v_pred*real_v).sum(dim=1)
    dot = torch.clamp(dot,min=-0.99,max=0.99)
    return torch.acos(dot).mean()

def forward(self,v_pred,v_real,init_coords):

    position_pred = v_pred * self.step * self.d_time + init_coords
    position_real = v_real * self.step * self.d_time + init_coords

    position_error = 0
    angle_error = 0
    velocity_error = 0

    if self.A >0:
        velocity_error = self.mae(v_real,v_pred)

    if self.B >0:
        position_error = self.mae(position_real,position_pred)

    if self.C >0:
        angle_error = self.velocity_angle(v_pred,v_real)
        if torch.isnan(angle_error):
            print("nan error")

    error = self.A * velocity_error + self.B * position_error + self.C * angle_error

    return  error

'

Training and results

Training dynamics are not bad and there is no overfitting over time (using these data splits). After standarizing the features I had to apply gradient clipping to 1 in order to stabilize the gradient, otherwise it exploded. Typical training curves are like this one (notice log scale):

Results are in the order of ~65 RMSE

Problems

One of the main problems is that using gpu in training doesnt speed up the process even using large batches (128 or 512)

rikrd commented 6 months ago

Several questions:

How much larger in batch size can you go?
When you say that GPU training doesn't speed up, is it in comparison to CPU?
Have you tried with larger neighborhoods or more layers of convs? Perhaps with small kernels and shallow networks, the GPU is less useful?

rikrd commented 6 months ago

More questions:

Are you keeping the hold graph in a Data object of PyTorch Geometric and then using NeighbourLoader to sample for the minibatch? I think you should try to use as much as possible the library's tools, since they may be quite well optimized for GPU usage.
Does the whole dataset fit in CPU memory?

franamor98 commented 6 months ago

Really large up to more than 2000. The input features are of shape [N_neigh*batch, n_features]. Normally n_features = 6. Both adding more neighbors and using bigger batches have the same effect for the input tensor.
Yes, in comparison with my laptop cpu (old i7).
I am waiting for results in terms of number of neighbors, but in principle it does not affect significantly the results. Also the best results I am obtaining are with shallow nets, increasing the number of convs doesnt seem to improve (its even worse).
(and 5.) I am using a normal torch dataset that keeps the object in memory (I believe) and then passing it to the custom torch geometric dataloader (https://pytorch-geometric.readthedocs.io/en/latest/modules/loader.html#torch_geometric.loader.DataLoader), I dont use NeighbourLoader.

franamor98 commented 6 months ago

model update:

class GCNN_2G(nn.Module):

def __init__(self, num_features1,num_features2, hidden_channels, output_size,dropout=0.1):
    super(GCNN_2G, self).__init__()
    # conv layers as a test [WIP]

    channels_v, channels_e = hidden_channels

    if len(channels_v) != len(channels_e):
        raise ValueError("Different number of layers for vertices and elements")
    if len(channels_v) == 0:
        raise ValueError("At least one layer is needed")
    if channels_e[-1] != channels_v[-1]:
        raise ValueError("Last layer must have the same number of channels for vertices and elements")

    self.convs1 = nn.ModuleList([ChebConv(num_features1, channels_v[0],K=1,dropout=dropout)])
    for i in range(len(channels_e)-1):
        self.convs1.append(ChebConv(channels_e[i], channels_e[i+1],K=1,dropout=dropout))

    self.convs2 = nn.ModuleList([ChebConv(num_features2, channels_e[0],K=1,dropout=dropout)])
    for i in range(len(channels_v)-1):
        self.convs2.append(ChebConv(channels_v[i], channels_v[i+1],K=1,dropout=dropout))

    self.fc = nn.Linear(channels_v[-1], output_size)

def forward(self,g1,g2):

    for conv in self.convs1:
        x1 = conv(g1.x, g1.edge_index,g1.edge_attr)
        x1 = F.relu(x1)

    for conv in self.convs2:
        x2 = conv(g2.x, g2.edge_index,g2.edge_attr)
        x2 = F.relu(x2)

    x2 = global_mean_pool(x2, g2.batch)
    x1 = global_mean_pool(x1, g1.batch)      

    #concatenate both outputs
    x = torch.stack([x1,x2],dim=0)
    x = torch.mean(x,dim=0)
    # Fully connected layer for the final output
    x = self.fc(x)

    return x

nansencenter / nextsim_dynamics_ml

ML setup #6

Machine Learning setup and training issues

Data

Representation and input

Model

Loss Function

Training and results

Problems