lucidrains / meshgpt-pytorch

Implementation of MeshGPT, SOTA Mesh generation using Attention, in Pytorch
MIT License
714 stars 58 forks source link

Missing features for graph embedding #3

Closed MarcusLoppe closed 9 months ago

MarcusLoppe commented 9 months ago

Hi,

First time posting on a projects issue page so apologies if I make any mistakes. I've read through the paper many times I think that you are not embedding all the features mentioned in the paper, I believe the features is (F): 9 (coordinates) + 1 (area) + 3 (angles) + 3 (normal) = 16

So since F = 16, the input for the graph encoder should be 16x196 and out 16x 576 per face. I figure that I post here now since you have progressed the project quite a bit and probably testing it soon.

I'm not great at tensor programming so I just asked ChatGPT to modify the encoder using the details from the paper. The code is probably incorrect since I don't 100% understand the tensor operations you are doing but at least I can provide with some inspiration or boilerplate example.

@beartype
def encode(
    self,
    *,
    vertices:         TensorType['b', 'nv', 3, int],
    faces:            TensorType['b', 'nf', 3, int],
    face_edges:       TensorType['b', 'e', 2, int],
    face_mask:        TensorType['b', 'nf', bool],
    face_edges_mask:  TensorType['b', 'e', bool],
    return_face_coordinates = False
):
    # ... [existing code up to face_embed definition] ...

    # Calculate additional face attributes
    # Using vertices and faces to calculate the area, angles, and normal for each face
    face_vertices = vertices.gather(1, faces) # Gather vertices for each face
    sides = face_vertices[:, :, [1, 2, 0], :] - face_vertices[:, :, [0, 1, 2], :]
    side_lengths = sides.norm(dim=-1)

    # Calculate area (using Heron's formula for simplicity)
    s = side_lengths.sum(dim=-1) / 2
    area = torch.sqrt(s * (s - side_lengths[:, :, 0]) * (s - side_lengths[:, :, 1]) * (s - side_lengths[:, :, 2]))
    area = area.unsqueeze(-1) # Reshape for concatenation

    # Calculate angles (using cosine rule)
    angles = torch.acos((side_lengths[:, :, [1, 2, 0]] ** 2 + side_lengths[:, :, [2, 0, 1]] ** 2 - side_lengths[:, :, [0, 1, 2]] ** 2) / (2 * side_lengths[:, :, [1, 2, 0]] * side_lengths[:, :, [2, 0, 1]]))
    angles = angles.flatten(start_dim=-2) # Flatten angles for concatenation

    # Calculate normals (using cross product)
    normals = torch.cross(sides[:, :, 0, :], sides[:, :, 1, :], dim=-1)
    normals = normals / normals.norm(dim=-1, keepdim=True) # Normalize

    # Concatenate additional features
    face_additional_features = torch.cat([area, angles, normals], dim=-1)
    face_additional_features = rearrange(face_additional_features, 'b nf d -> b nf (d)')

    # Concatenate with existing face embeddings
    face_embed = torch.cat([face_embed, face_additional_features], dim=-1)

    # ... [rest of the existing code] ...

    return face_embed, face_coords
lucidrains commented 9 months ago

face_additional_features = rearrange(face_additional_features, 'b nf d -> b nf (d)')

oh chatgpt... smh

lucidrains commented 9 months ago

@MarcusLoppe sounds good, chatgpt not needed. better just to prompt me :) could you screenshot the relevant section for area + angles?

lucidrains commented 9 months ago

this isn't a big deal, as those are probably all derived values off the 3 vertices. it makes sense they would do a little bit of feature engineering, although in the decoder, i don't think they predict those derived values, just the 3 discretized coordinates. let me know the section where they predict the area + angles and i'll update my views.

MarcusLoppe commented 9 months ago

It's true that the decoder probably can be calculated by the model but I'm guessing that is quite hard for the model to learn :) Im guessing the angle can help it understand it's shape.

The paper has Fx192 as in and Fx576 as out in the graph encoder and then the ResNet inputs the features Fx576. But the ResNet outputs 9|F|x128 so It's not predicting those values just the coordinates (?) "The decoder, a 1D ResNet-34 [22], interprets face features as a sequence, outputting logits for the discretized face triangle coordinates in a 128^3 space"

Here is the details from the paper, you can search for " angle" for all the results.

B. Method Details B.1. Architecture The architecture of our encoder-decoder network is elaborated in Fig. 14. The encoder comprises a series of SAGE- Conv [20] graph convolution layers, processing the mesh in the form of a face graph. For each graph node, input features include the positionally encoded 9 coordinates of the face triangle, its area, the angles between its edges, and the normal of the face.

Under User Study details: "Figure 14. Our encoder-decoder network features an encoder with SAGEConv [20] layers processing mesh faces as a graph. Each node inputs positionally encoded face triangle coordinates, area, edge angles, and normal. The decoder, a 1D ResNet-34 [22], interprets face features as a sequence, outputting logits for the discretized face triangle coordinates in a 128^3 space."

lucidrains commented 9 months ago

@MarcusLoppe ok yea, i'll take care of the remaining derived values tomorrow morning. it is easy to do

lucidrains commented 9 months ago

@MarcusLoppe and yea, it doesn't seem like they reconstruct anything else but the discretized coordinates outputting logits for the discretized face triangle coordinates in a 128^3 space.

lucidrains commented 9 months ago

thank you for the code review 💯

MarcusLoppe commented 9 months ago

Awesome, you'd think it's possible to implement a description using a text embedding vector at the start of each sequence for the transformer or do you think the training/inference will take a huge performance loss due to such a huge vector? :)

Thanks for all the work :) It will be exciting to see the 3d output since most 3D generation models are quite bad.

lucidrains commented 9 months ago

@MarcusLoppe will actually be using cross attention with classifier free guidance! i'm sure prefix attention would work, but like you noted, would make sequence length longer

lucidrains commented 9 months ago

Awesome, you'd think it's possible to implement a description using a text embedding vector at the start of each sequence for the transformer or do you think the training/inference will take a huge performance loss due to such a huge vector? :)

Thanks for all the work :) It will be exciting to see the 3d output since most 3D generation models are quite bad.

towards the holodeck!

lucidrains commented 9 months ago

oh, we can keep this open until I finish it

lucidrains commented 9 months ago

@MarcusLoppe want to see if the most recent code lines up with your expectations?

MarcusLoppe commented 9 months ago

Not quite, I trained it before this commit to generate a box (8v, 12 triangles) and it went relatively fast. But now the loss is stuck for the autoencoder and might go down 0.1 per 50 step (1 000 000) examples.

The reason why they limit the xyz to 128 class values is due to when the regression is predicting it might predict a range of values for the same triangle size e.g 0.44-0.48 instead of 0.55 (correct) . But using a classes of fixed values it can generate a more uniform mesh since it might guess 0.50 for all and it would seem to fit.

As far as i understand the extra features seems be discretized when it's not necessary since it wont predict those values. I'm not sure about the vector sizes required for the features but just passing the real angle as a float value seems to be fine according to the paper. The details in the paper doesn't specify the sizes for the embedding input except for one picture but 196 doesn't seem correct since the coordinates are in a 128 vector (?) .

Maybe we should contact the person that hosts the github page for the MeshGPT paper? If so any good question you can think of to ask about the embedding input? :)

image image

lucidrains commented 9 months ago

@MarcusLoppe hey Marcus, thank you for testing it out so quickly and for noticing the discrepancy in training w/ the previous version

let's try your suggestion to just use the continuous angles and area. for the normals, i'll keep those discrete

if it is still not behaving at least parity with the previous version, i'll just put in some engineering work and make it all optional; which derived features to include and whether they should be discrete or continuously embedded

MarcusLoppe commented 9 months ago

@MarcusLoppe hey Marcus, thank you for testing it out so quickly and for noticing the discrepancy in training w/ the previous version

let's try your suggestion to just use the continuous angles and area. for the normals, i'll keep those discrete

if it is still not behaving at least parity with the previous version, i'll just put in some engineering work and make it all optional; which derived features to include and whether they should be discrete or continuously embedded

Hmm, still some issues. I'm not sure exactly how you are encoding the data but i think you should just use raw values (if possible)? E.g the normal's should be passed as 3 float values calculated by: (c - b).Cross(a - b).Normalize();

I've done little GCN's using mostly keras so not sure about the continuous_embed but if everything is correctly it might help to just include the angle relative to ground or the normals.

lucidrains commented 9 months ago

@MarcusLoppe i added a warmup period yesterday btw (configurable here), so it takes ~1k steps before the training is at full learning rate. could it be you are comparing the initial learning rates not knowing about this?

MarcusLoppe commented 9 months ago

@MarcusLoppe i added a warmup period yesterday btw (configurable here), so it takes ~1k steps before the training is at full learning rate. could it be you are comparing the initial learning rates not knowing about this?

I think it was :) I did some testing and successfully recreated 80 face sphere and 240 face chair. I had to train the autoencoder & transformer for about 5 epochs for decent results.

I did some testing using discretize vs continuous_embed and found that discretize values for the features is slightly better, I'm not sure about more complex and larger datasets but I tested it using a 80 face sphere. At epoch 10 it had 0.273 loss using discretize features vs 0.4 using continuous_embed . This is just early testing so I don't know for sure that using discretize is better but it might be a indicator. See below at picture for some nice graphs and numbers :)

Chair 240 face model: https://fastupload.io/en/XBWlQQECALaPI4Q/file

Discretize vs Continuous_embed bild

lucidrains commented 9 months ago

@MarcusLoppe omg! thank you for sharing your experiments and these results! i will change them back to discrete today

the other idea i thought of is to do a joint discrete + continuous embedding. makes a lot of sense to me, but as usual, should let empiricism speak (will make it so one can ablate the continuous embed additive sum)

lucidrains commented 9 months ago

@MarcusLoppe for the memory issues in the transformer, are you using flash attention?

lucidrains commented 9 months ago

@MarcusLoppe also, you can plan on sequence lengths being twice as long, as i will probably build out the RQ transformer (where residual codes are summed initially, and then each residual code decoded by a smaller transformer at the end, hierarchical fashion)