Question - Githubissues

nicolasdonati commented 7 months ago

Hi everyone,

I had some questions about this method:

The transformer seems to learn about 3D meshes as a sequence, but the order of that sequence must have a lot of importance then, from what I understood all meshes are ordered from bottom to top (along z axis), is that correct ? Then if we pass as a prompt the middle part of the mesh, this will not work for completion ? Also, does the encoding depend on the order of the sequence ?
I built a toy dataset of 1000 shapes with 100 faces each (no text labels), I expected things to train quite efficiently there but while autoencoder seemed to train Ok, even after >24 hours train the transformer still has not converged to a reasonable loss. I feel like I may be missing something.

I hope people can help :) Have a great day,

lucidrains commented 7 months ago

@nicolasdonati there is a way to do infilling (what you are describing with the first bullet point) with autoregressive transformers, devised by openai themselves. however, i don't expect it to work as well as denoising diffusion, masked denoising, or other NAR methods

as for your second, maybe you can consult Marcus. how much experience do you have training transformers?

MarcusLoppe commented 7 months ago

The transformer seems to learn about 3D meshes as a sequence, but the order of that sequence must have a lot of importance then, from what I understood all meshes are ordered from bottom to top (along z axis), is that correct ? Then if we pass as a prompt the middle part of the mesh, this will not work for completion ? Also, does the encoding depend on the order of the sequence ?

Correct, the MeshGPT did this due it's how Polygen did it which was created by deepmind so there are probably some reasoning behind it. It might work to shuffle the faces to promote the transformer to be more generalized and I'm guessing that would require bigger transformer model & dataset. I don't that the autoencoder cares since the graph encodes the data into nodes which isn't in a order.

I built a toy dataset of 1000 shapes with 100 faces each (no text labels), I expected things to train quite efficiently there but while autoencoder seemed to train Ok, even after >24 hours train the transformer still has not converged to a reasonable loss. I feel like I may be missing something.

One thing to keep in mind is that the MeshGPT paper fine-tuned on e.g chairs and then used that chair variant to generate chairs but couldn't generate tables. So imagine that you have a book with all the type of triangles shapes of chairs, but then you need to include other shapes such as tables etc. This requires the book to be a lot bigger, generalized and each triangle needs a better description, this is the issue I'm struggling with.

My advice is to take a look of the actual decoded result that the autoencoder generates with the code below, the transformer can perform well if it's vocab doesnt make sense. Then you can increase the codebook and it's dim sizes, you can also try increasing the ResNet and ConvSage models sizes.

Also make the transformer a bit bigger using 512/768 dim and set attn_depth to 24.

num_layers = 23
decoder_dims_through_depth  =   ( 192,192,192, 256,  256, 256, 256, ) + (384,) * num_layers + (576,576,576,)

autoencoder = MeshAutoencoder(    
        encoder_dims_through_depth = (256, 384, 576 , 768, 1152 ),
        decoder_dims_through_depth  =decoder_dims_through_depth,
        dim_codebook = 192 * 2,
        codebook_size = 16384 *2 )

dataset.generate_codes(autoencoder)

import torch
import random
from tqdm import tqdm 

min_mse, max_mse = float('inf'), float('-inf')
min_coords, min_orgs, max_coords, max_orgs = None, None, None, None
random_samples = []
total_mse = 0.0 

random.shuffle(dataset.data)

for item in tqdm(dataset.data):
    codes = item['codes'].flatten().unsqueeze(0)
    codes = codes[:, :codes.shape[-1] // 2 * 2]

    coords, mask = autoencoder.decode_from_codes_to_faces(codes)
    orgs = item['vertices'][item['faces']].unsqueeze(0)

    mse = torch.mean((orgs.view(-1, 3).cpu() - coords.view(-1, 3).cpu())**2)
    total_mse += mse

    if mse < min_mse:
        min_mse, min_coords, min_orgs = mse, coords, orgs

    if mse > max_mse:
        max_mse, max_coords, max_orgs = mse, coords, orgs

    if len(random_samples) <= 20:
        random_samples.append([coords, orgs])

avg_mse = total_mse / len(dataset.data) 

print(f'MSE AVG: {avg_mse:.10f}, Min: {min_mse:.10f}, Max: {max_mse:.10f}') 
combined_samples = [[min_coords, min_orgs], [max_coords, max_orgs], []] + [[sample[0], sample[1]] for sample in random_samples]
combind_mesh_with_rows(f'./quad_testing/mse_rows.obj', combined_samples, True)

def combind_mesh_with_rows(path, meshes):
    all_vertices = []
    all_faces = []
    vertex_offset = 0
    translation_distance = 0.5  
    obj_file_content = ""

    for row, mesh in enumerate(meshes): 
        for r, faces_coordinates in enumerate(mesh): 
            numpy_data = faces_coordinates[0].cpu().numpy().reshape(-1, 3)  
            numpy_data[:, 0] += translation_distance * (r / 0.2 - 1)  
            numpy_data[:, 2] += translation_distance * (row / 0.2 - 1)  

            for vertex in numpy_data:
                all_vertices.append(f"v {vertex[0]} {vertex[1]} {vertex[2]}\n")

            for i in range(1, len(numpy_data), 3):
                all_faces.append(f"f {i + vertex_offset} {i + 1 + vertex_offset} {i + 2 + vertex_offset}\n")

            vertex_offset += len(numpy_data)

        obj_file_content = "".join(all_vertices) + "".join(all_faces)

    with open(path , "w") as file:
        file.write(obj_file_content)

nicolasdonati commented 7 months ago

Hi guys ! Many thanks for all the help and tips :) I dont have experience for training transformers specifically (I come from another part of Deep Learning) I will take time to ponder what you said and come back to you then !

MarcusLoppe commented 7 months ago

Hi guys ! Many thanks for all the help and tips :) I dont have experience for training transformers specifically (I come from another part of Deep Learning) I will take time to ponder what you said and come back to you then !

You can take a look at my demo notebook if you are having some trouble. You can see how to pre-process the mesh in the first code block (with function get_mesh).

But I guess that step 1 is to check the autoencoder output, you can use the code below if you are not using my fork of this repo. If the autoencoder can encode the mesh and decode it by predicting the positions of the 'codes'/'triangles' and the output looks descent then the fault may be within the transformer.

import torch
import random
from tqdm import tqdm 

min_mse, max_mse = float('inf'), float('-inf')
min_coords, min_orgs, max_coords, max_orgs = None, None, None, None
random_samples = []
total_mse = 0.0 

random.shuffle(dataset.data)

for item in tqdm(dataset.data):
    codes = autoencoder.tokenize(
                vertices = item['vertices'],
                faces = item['faces'],
                face_edges = item['face_edges']
    ) 
    codes = item['codes'].flatten().unsqueeze(0)
    codes = codes[:, :codes.shape[-1] // 2 * 2]

    coords, mask = autoencoder.decode_from_codes_to_faces(codes)
    orgs = item['vertices'][item['faces']].unsqueeze(0)

    mse = torch.mean((orgs.view(-1, 3).cpu() - coords.view(-1, 3).cpu())**2)
    total_mse += mse

    if mse < min_mse:
        min_mse, min_coords, min_orgs = mse, coords, orgs

    if mse > max_mse:
        max_mse, max_coords, max_orgs = mse, coords, orgs

    if len(random_samples) <= 20:
        random_samples.append([coords, orgs])

avg_mse = total_mse / len(dataset.data) 

print(f'MSE AVG: {avg_mse:.10f}, Min: {min_mse:.10f}, Max: {max_mse:.10f}') 
combined_samples = [[min_coords, min_orgs], [max_coords, max_orgs], []] + [[sample[0], sample[1]] for sample in random_samples]
combind_mesh_with_rows(f'./quad_testing/mse_rows.obj', combined_samples, True)

adeerBB commented 6 months ago

@MarcusLoppe Hi, so I had a problem, I had 260 models (less than 1000 faces), augmented them 100 times, my encoder loss on 100 epochs reaches 1.57, while my transformer reaches around 3 on 25 epochs. Can you tell me a little about loss? Is it necessary that it gets below 1 or depends on dataset? Also does the batch size has any effect on this? Thanks

nicolasdonati commented 6 months ago

Hi again ! So I tried some things:

first of all do you know if a scheduler could be useful for this training ? and if so how can it be added to the trainer ? I got errors that led to recent issues in pytorch when I tried specifying e.g. a stepLR scheduler.
secondly, regarding what you said that the autoencoder does not care about the order in which the faces are presented to it, I guess that is True for the encoder as you said because of the graph-based convolution structure. However for the decoder that might not be True right ? since it is based on a 1d Convolution which depends on the order of the tokens.
thirdly, I tried training this with a very particular set of shapes. They are random blobs that I remeshed to 100 faces. I also added jitter to prevent overfitting of the autoencoder. I have around 1000 shapes right now, with a topology rather different from the chair/table setup. Do you think that could work ? On my side I could not make the transformer network converge to a loss lower than 0.3, which I think is quite bad (and it shows visually when I generate, although the results tend to prove it has learned something, but is just not precise enough). Maybe the learning rate is the issue but I guess it could also be that my triangulations (QEM-remeshed) are just not suited for the purpose of transformer networks. I am really curious to know what you think ! In the meantime, have a great day :) Many thanks

MarcusLoppe commented 6 months ago

of all do you know if a scheduler could be useful for this training ? and if so how can it be added to the trainer ? I got errors that led to recent issues in pytorch when I tried specifying e.g. a stepLR scheduler.

secondly, regarding what you said that the autoencoder does not care about the order in which the faces are presented to it, I guess that is True for the encoder as you said because of the graph-based convolution structure. However for the decoder that might not be True right ? since it is based on a 1d Convolution which depends on the order of the tokens.

thirdly, I tried training this with a very particular set of shapes. They are random blobs that I remeshed to 100 faces. I also added jitter to prevent overfitting of the autoencoder. I have around 1000 shapes right now, with a topology rather different from the chair/table setup. Do you think that could work ? On my side I could not

@nicolasdonati

It might not matter very much, the easy part is the start of the training, the hard parts seems to come when the autoencoder reaches about 0.4 and transformer around 0.01.

I'm not quite sure, it's probably safer to assume it somehow matter's. The encoder will take the edges and the faces and encode them and output a 1D graph seq. When it inputs the mesh the faces are represented as nodes but the output would probably be in some sort of order. Then the decoder will use this sequence to predict positions. Probably doesnt matter for the decoder but encoder might have a problem with it, I'm actually not sure :)
I think more 'blob' figures might have better luck since I've notice during training the autoencoder usually messed up the prediction of the skinny tables but the more beefy chair shapes is always looking better. I think it's somehow also connected to how the encoder might not be able to compress and represent the low poly shapes as good. It might also have something to do with the blur affect aswell. An example of what I'm talking is the shape dataset from polygen, It contained; sphere, crone, cylinder and a box. The box contains 12 triangles and the rest from 30-100. The funny thing is that the box was the hardest to predict the shape of and messed up by angling some side wrong.

Have you tried feeding it a prompt of tokens when generating? Usually it helps with just a few tokens and it will kick start the mesh generating in the correct direction. You can then try to train a transformer on just the first 60 tokens of each mesh and let that kick start the generation.

Also what are your model specifications for the transformer?

transformer = MeshTransformer(
    autoencoder,
    dim = 768,
    coarse_pre_gateloop_depth = 6, # Better performance using more gateloop layers
    fine_pre_gateloop_depth= 4, 
    attn_depth = 12, // or 24
    attn_heads = 8,
    max_seq_len = max_seq, 
    condition_on_text = True,
    text_condition_model_types = "bge", 
    text_condition_cond_drop_prob = 0.01,
)

Feeding tokens:

token_length_procent = 0.30 
codes = []
texts = []
random.shuffle(dataset.data)
for label in list(labels)[:2]:
    for item in dataset.data: 
        if item['texts'] == label:
            num_tokens = int(item["codes"].shape[0] * token_length_procent) 

            texts.append(item['texts']) 
            codes.append(item["codes"].flatten()[:num_tokens].unsqueeze(0))  
            break 

coords = []   
for text, prompt in zip(texts, codes): 
    print(f"Generating {text} with {prompt.shape[1]} tokens")
    faces_coordinates = transformer.generate(texts = [text],  prompt = prompt, temperature = 0) 
    coords.append(faces_coordinates) 

combind_mesh(f'{folder}/text+prompt_all.obj', coords)

MarcusLoppe commented 6 months ago

@MarcusLoppe Hi, so I had a problem, I had 260 models (less than 1000 faces), augmented them 100 times, my encoder loss on 100 epochs reaches 1.57, while my transformer reaches around 3 on 25 epochs. Can you tell me a little about loss? Is it necessary that it gets below 1 or depends on dataset? Also does the batch size has any effect on this? Thanks

@adeerBB Not sure about that, it seems very bad. It might be due to the autoencoder might be bit too small, try giving this a go.

By increasing the encoder & decoder and the codebook dim size it will allow it describe the tokens bit better. Using just 64 dim for a vector might be too low, I increased it so it uses 128x3 dim size per codebook entry.

During training for the autoencoder I use a 64 batch size (using 2-4 grad_accum_every until it reaches 1.5 in loss), this will promote generalization about shapes. Since the transformer requires bit more VRAM I usually only can get it to 8 batch size and set the grad_accum_every to 8 to compensate this to 'create' a batch size of 64.

num_layers = 23
decoder_dims_through_depth  =    (128,) * 3 + (192,) * 4 + (256,) * num_layers + (384,) * 3  
autoencoder = MeshAutoencoder(    
        encoder_dims_through_depth = (256, 384, 576 , 768, 1152 ), 
        decoder_dims_through_depth  = decoder_dims_through_depth,
        dim_codebook = 192*2 ,
        codebook_size = 16384  ,
        dim_area_embed = 32 * 1,
    ).to("cuda")

adeerkhan commented 6 months ago

@MarcusLoppe Sorry for the late response, thanks, I'll give a go.

adeerkhan commented 6 months ago

@MarcusLoppe Hi, so I had a question, have you tried transfer learning for this? For example if I can train my model for objaverse, can I then fine-tune the model for my specific set of data? This way we can gain the shape generation from a large dataset.

lucidrains / meshgpt-pytorch

Question #57