hyper-parameter suggestion

chinmay5 commented 1 month ago

Thank you so much for the code. I am trying to work with a custom dataset of 116 meshes. I tried to run the model based on @MarcusLoppe example notebook. Although the code runs, I do not see good reconstruction. Please note that at this point, I am only trying to train the VAE and not the autoregressive model.

Perhaps, something is missing in the code, or I am making some fundamental mistake. It would be great if people who have managed to get things to run could take a look at the code.

autoencoder = MeshAutoencoder(
        decoder_dims_through_depth=(128,) * 6 + (192,) * 12 + (256,) * 24 + (384,) * 6,
        # codebook_size = 2048, for the 250 face dataset, more face count probably requires 16k.
        # Default value is 16384
        dim_codebook=192,
        dim_area_embed=16,
        dim_coor_embed=16,
        dim_normal_embed=16,
        dim_angle_embed=8,
        attn_decoder_depth=4,
        attn_encoder_depth=2
    ).to("cuda")

The main training loop uses

def train(checkpoint):
    autoencoder = create_model(checkpoint)
    dataset = create_dataset()
    increase_dataset_size(dataset)
    batch_size = 16  # The batch size should be max 64.
    grad_accum_every = 4
    # So set the maximal batch size (max 64) that your VRAM can handle and then use grad_accum_every to create a effective batch size of 64, e.g  16 * 4 = 64
    learning_rate = 1e-3  # Start with 1e-3 then at stagnation around 0.35, you can lower it to 1e-4.

    autoencoder.commit_loss_weight = 0.1  # Set dependant on the datasets size, on smaller datasets, 0.1 is fine, otherwise try from 0.25 to 0.4.
    autoencoder_trainer = MeshAutoencoderTrainer(model=autoencoder, warmup_steps=10, dataset=dataset,
                                                 num_train_steps=10000,
                                                 batch_size=batch_size,
                                                 grad_accum_every=grad_accum_every,
                                                 learning_rate=learning_rate,
                                                 checkpoint_every_epoch=100,
                                                 use_wandb_tracking=False,
                                                 checkpoint_folder=f'{PROJECT_ROOT_DIR}/mesh_on_vessels/checkpoints')
    autoencoder_trainer()

I observe quite a few issues that I am not able to handle.

With lr=1e-3 the commit loss becomes negative and even exceeds -1; I read from other issues that this problem is automatically solved once we increase the dataset size. However, I do not have more samples. Although, the function increase_dataset_size(dataset) tends to increase the number of samples by a factor of 50 using the augmentation code from @MarcusLoppe .
What is the appropriate number of steps for which the model should be trained?
In this configuration and with lr = 1e-4, I trained the model for 10000 steps and the loss was stuck at 1.7; Perhaps, that explains why the reconstructions are so bad. Does that mean I should just allow the model to train for longer?
Is the codebook size and decoder dims appropriate?
I use open3d and reduce the mesh size using simplify_quadric_decimation(800). Is this step a potential problem in the application?

Again, thank you so much for your time and effort. Hope people can give me some pointers to solve this issue.

Thanks, Chinmay

MarcusLoppe commented 1 month ago

@chinmay5

1 The commit loss going negative isn't really a problem, it's more of a indication that the quantization is 'overfitted'. But that is expected to happen using a small dataset. It's more useful metric when dealing with 10k+ meshes.

2 & 3 . Not sure about steps but for small dataset maybe 20-50 epochs to reach 0.5 recon loss then 30 more to reach 0.35 loss. If it's still above 1.0 recon loss after 50 epochs then there is a problem.

4 Looks good, you probably can reduce the codebook size but should be fine.

I think that the issue is how you simplified your mesh, if there is a highly detailed mesh with 10k triangles then you cannot reduce it to 800 triangles without breaking it. So if you'd like to preprocess the meshes you'll need to check if the mesh got destroyed but some metric such as hausdorff distance or visual inspect it.

Here is some code that will let you render the dataset so you know if the data it's training on is corrupted

all_vertices = []
all_faces = []
vertex_offset = 0
translation_distance = 0.5  

for r, item in enumerate(dataset.data): 
    vertices_copy =  item['vertices'].cpu().numpy()     
    vertices_copy[:, 0] += translation_distance * (r / 0.2 - 1)  

    for vert in vertices_copy:
        vertex = vert
        all_vertices.append(f"v {float(vertex[0])}  {float(vertex[1])}  {float(vertex[2])}\n") 
    for face in item['faces']:
        all_faces.append(f"f {face[0]+1+ vertex_offset} {face[1]+ 1+vertex_offset} {face[2]+ 1+vertex_offset}\n")  
    vertex_offset = len(all_vertices)

obj_file_content = "".join(all_vertices) + "".join(all_faces)

obj_file_path = "./3d_models_inspect.obj"
with open(obj_file_path, "w") as file:
    file.write(obj_file_content)

chinmay5 commented 1 month ago

Hi @MarcusLoppe ,

Thank you so much for the quick response. I double-checked the statistics of my dataset. Currently, 90% of the meshes contain around 4k faces at max. There are around 10 meshes that contain a large number of faces (~10k). Does it make sense to increase the number of faces to 1.5k for the simplify_quadric_decimation, or should I just remove these very large meshes?

Just to confirm, I use a batch size of 16. With 50 augmentations, the dataset is 116 * 50. Thus, a single epoch is 363 iterations. Hence, 50 epochs takes around 18,000 iterations. I will restart the training with the same configurations for 20k iterations. If the reconstruction loss still remains ~1.6, then I have some bugs with the code. I can imagine the decimation is a problem. Just to be safe, can you please confirm if the hyper-parameters look suitable for the small dataset?

One last thing, you mentioned two rounds of execution. The first 50 epochs should be executed using a lr=1e-3 and the next 30 epochs with lr=1e-4?

MarcusLoppe commented 1 month ago

Hi @MarcusLoppe ,

Thank you so much for the quick response. I double-checked the statistics of my dataset. Currently, 90% of the meshes contain around 4k faces at max. There are around 10 meshes that contain a large number of faces (~10k). Does it make sense to increase the number of faces to 1.5k for the simplify_quadric_decimation, or should I just remove these very large meshes?

I've only tested the autoencoder & transformer on meshes with max 2k triangles, longer then that would require quite a lot of compute and model size. The autoencoder can handle 2k+ meshes fine but the issue is with the transformer for highly detailed meshes. So I'd recommend that you remove the meshes that cannot be decimated under 2k without major corruptions.

Just to confirm, I use a batch size of 16. With 50 augmentations, the dataset is 116 * 50. Thus, a single epoch is 363 iterations. Hence, 50 epochs takes around 18,000 iterations. I will restart the training with the same configurations for 20k iterations. If the reconstruction loss still remains ~1.6, then I have some bugs with the code. I can imagine the decimation is a problem. Just to be safe, can you please confirm if the hyper-parameters look suitable for the small dataset?

The hyper-parameters looks fine, but since your overfitting the model using a small dataset I think it would be fine with just a couple of augmentations (x5) or with 0 augmentations. The augmentations is meant to make the model more robust but in your case it probably won't make a difference.

I think that the issue is with the dataset and it's currently "working" since it replicates the broken mesh. Could you run the code I posted above to inspect the dataset then you'll know if the decimation is the problem.

One last thing, you mentioned two rounds of execution. The first 50 epochs should be executed using a lr=1e-3 and the next 30 epochs with lr=1e-4?

Correct, 1e-3 till 0.4 recon loss and then switch to 1e-4.

chinmay5 commented 1 month ago

Hi @MarcusLoppe. I have restarted the training and now the loss is looking a bit better. Thanks for your inputs. These results are for the "first-phase" where I am training with a learning rate of 1e-3. I will start the next phase once the training is complete.

As you mentioned, the current dataset size is small (116 samples). In your opinion, what will be a good sample size? What is a good sample size beyond which the autoencoder will learn codes that can be generalized? Or is 116 samples with 50 augmentations already sufficient?

I am attaching an example of mesh pre and post decimation operation.

Is this a good candidate to remove from the training?

I will also start a new training iteration after removing meshes that were decimated poorly. However, I will wait for the current training to complete. I will try to respond as soon as the current training is over.

Thanks again for your assistance and sharing your expertise.

chinmay5 commented 1 month ago

Hi @MarcusLoppe,

The first phase of training is complete, and I do see better results.

So, at least the autoencoder part seems to be working fine. I will now perform the second stage of training with lr=1e-4 to try and get to ~0.3 loss.

It would be great if you can comment about the questions from the previous post. Perhaps, most about the dataset size and the decimation results.

Again, thanks so much for your help.

chinmay5 commented 1 month ago

Hi @MarcusLoppe and @lucidrains I plan to invest some time in collecting more data. Since the two of you are much more experienced with model training, can you please answer,

As you mentioned, the current dataset size is small (116 samples). In your opinion, what will be a good sample size? What is a good sample size beyond which the autoencoder will learn codes that can be generalized? Or is 116 samples with 50 augmentations already sufficient?

Thanks again for all your help. Best, Chinmay

MarcusLoppe commented 1 month ago

Hi @MarcusLoppe and @lucidrains I plan to invest some time in collecting more data. Since the two of you are much more experienced with model training, can you please answer,

As you mentioned, the current dataset size is small (116 samples). In your opinion, what will be a good sample size? What is a good sample size beyond which the autoencoder will learn codes that can be generalized? Or is 116 samples with 50 augmentations already sufficient?

Thanks again for all your help. Best, Chinmay

Hi,

For the model to learn on how to generate novel meshes it might require 100k+. It's quite the hard task, you can try to look up when. Image generation with tokens was able to generate novel images and how much data was required. The augmentation purpose is to make the model more resistant against small deviations in the auto-encoder and transformer.

Using a smaller dataset is fine since most use-cases I've seen is mainly been to replicate 3d models so instead of searching through 100k models you can prompt it.

The decimated mesh looks surprisingly good, the main issues to look for is holes and when the decimation ends up just removing half the triangles so there is 100s of holes (like your first screenshots)

I've gotten to 0.338 using 100k 3d meshes that contains max 1000 triangles so it's capable to reach those levels.

Although I'm bit worried over your loss improvements, the recon loss should just go down/stall and not go up. That is a indication of instability. Try to increase the commit loss weight to 0.3 and maybe lower the effective batch size by reducing grad accumulation to 2 (from 4). It can also be due to the small dataset, is it still going up if you remove all the extra augmentations?

chinmay5 commented 1 month ago

Hi @MarcusLoppe , I am training with augmentations. Perhaps that is why the loss is not stable. I will change the hyper-parameters based on your suggestion. Just to confirm, the best I can hope for with approximately 500 meshes is to replicate the training data. With this scale, there is no chance of generating novel meshes. (The meshes need not be very complex or very different from training samples).

Best, Chinmay

MarcusLoppe commented 1 month ago

Hi @MarcusLoppe , I am training with augmentations. Perhaps that is why the loss is not stable. I will change the hyper-parameters based on your suggestion.

Let me know how it goes!

Just to confirm, the best I can hope for with approximately 500 meshes is to replicate the training data. With this scale, there is no chance of generating novel meshes. (The meshes need not be very complex or very different from training samples).

Best, Chinmay

Very unlikely,, the logic behind a mesh is much more complex then a image due to the connectivity and other factors.

lucidrains commented 1 month ago

Hi @MarcusLoppe,

The first phase of training is complete, and I do see better results.

So, at least the autoencoder part seems to be working fine. I will now perform the second stage of training with lr=1e-4 to try and get to ~0.3 loss.

It would be great if you can comment about the questions from the previous post. Perhaps, most about the dataset size and the decimation results.

Again, thanks so much for your help.

is this 3d ct angiography?

chinmay5 commented 1 month ago

Hi @lucidrains , the dataset is built from MRA images (https://arxiv.org/pdf/2003.02920). Since these are medical images, the dataset size is much smaller compared to the vision counterpart.

Best, Chinmay

chinmay5 commented 1 month ago

Hi @MarcusLoppe, I trained the model with the suggested hyper-parameters. Here is a screenshot. The loss seems to be stable around ~ 0.35.

I wanted to train the GPT as well. However, for it the loss decreases very slowly. Is it because of the small dataset size?

Here is the hyper-parameters I am using

gpt_transformer = MeshTransformer(
        autoencoder,
        dim=768,
        coarse_pre_gateloop_depth=3,
        fine_pre_gateloop_depth=3,
        attn_depth=12,
        attn_heads=12,
        max_seq_len=max_seq,
        condition_on_text=False,
        gateloop_use_heinsen=False,
        dropout=0.0,
    )

The training loops uses

batch_size = 8  # Max 64
grad_accum_every = 4

learning_rate = 1e-1  # Start training with the learning rate at 1e-2 then lower it to 1e-3 at stagnation or at 0.5 loss.
gpt_trainer = MeshTransformerTrainer(model=gpt_transformer, warmup_steps=1000, num_train_steps=10000,
                                dataset=dataset,
                                grad_accum_every=grad_accum_every,
                                learning_rate=learning_rate,
                                batch_size=batch_size,
                                checkpoint_every=1000,
                                checkpoint_folder=f'{save_dir}/checkpoints'
                          )

Here is the loss. This is after 400 iterations. So, more than one epoch. I am using a larger learning rate than in your code. Is the slow rate of loss decrease expected?

lucidrains commented 1 month ago

Hi @lucidrains , the dataset is built from MRA images (https://arxiv.org/pdf/2003.02920). Since these are medical images, the dataset size is much smaller compared to the vision counterpart.

Best, Chinmay

thought they looked familiar :smile: also my worst nightmare

chinmay5 commented 1 month ago

@lucidrains Because of the small dataset size, or some other difficulty with medical images?

lucidrains commented 1 month ago

@chinmay5 aneurysms lol

lucidrains commented 1 month ago

but yea, medical data is a nightmare too

chinmay5 commented 1 month ago

@MarcusLoppe These are the logs after 500 epochs

MarcusLoppe commented 1 month ago

Hi @MarcusLoppe, I trained the model with the suggested hyper-parameters. Here is a screenshot. The loss seems to be stable around ~ 0.35.

I wanted to train the GPT as well. However, for it the loss decreases very slowly. Is it because of the small dataset size?

Here is the hyper-parameters I am using
gpt_transformer = MeshTransformer(
        autoencoder,
        dim=768,
        coarse_pre_gateloop_depth=3,
        fine_pre_gateloop_depth=3,
        attn_depth=12,
        attn_heads=12,
        max_seq_len=max_seq,
        condition_on_text=False,
        gateloop_use_heinsen=False,
        dropout=0.0,
    )
The training loops uses
batch_size = 8  # Max 64
grad_accum_every = 4

learning_rate = 1e-1  # Start training with the learning rate at 1e-2 then lower it to 1e-3 at stagnation or at 0.5 loss.
gpt_trainer = MeshTransformerTrainer(model=gpt_transformer, warmup_steps=1000, num_train_steps=10000,
                                dataset=dataset,
                                grad_accum_every=grad_accum_every,
                                learning_rate=learning_rate,
                                batch_size=batch_size,
                                checkpoint_every=1000,
                                checkpoint_folder=f'{save_dir}/checkpoints'
                          )
Here is the loss. This is after 400 iterations. So, more than one epoch. I am using a larger learning rate than in your code. Is the slow rate of loss decrease expected?

Few things:

What is the actual sequence length? E.g max_seq.
You might want to consider training the auto-encoder using 1 quantize to get the sequence length down. I've been able to do 500 triangles meshes using gpt small but at 1000 triangles I had to switch to gpt-medium.
You can lower the learning rate to 1e-3. Its very unstable above 1e-2.
Conditioning the model using text goes faster so you might want to consider that.

I haven't trained using a tiny dataset for s while but getting below 2.0 loss should be pretty easy.

chinmay5 commented 1 month ago

Hi @MarcusLoppe,

the max_seq_length is 4800. (2 quantizers, 800 faces, and 3 vertices per face).

You might want to consider training the auto-encoder using 1 quantize to get the sequence length down. I've been able to do 500 triangles meshes using GPT small but at 1000 triangles I had to switch to gpt-medium.

So, would you suggest moving to GPT medium? I can also try reducing the quantization level, but it can reduce the quality of the autoencoder.

You can lower the learning rate to 1e-3. Its very unstable above 1e-2. I have started another training with 1e-3 learning rate.

I haven't trained using a tiny dataset for s while but getting below 2.0 loss should be pretty easy.

Yes, I saw that the loss got close to approx 1.1

Conditioning the model using text goes faster so you might want to consider that.

I do not have a direct textual conditioning candidate available for the dataset. However, I will try to explore a possible way.

Unfortunately, the GPT outputs at this loss value do not look very good.

I will train a bit longer. Maybe things improve at a lower loss value.

Thanks.

MarcusLoppe commented 1 month ago

Hi @MarcusLoppe,

the max_seq_length is 4800. (2 quantizers, 800 faces, and 3 vertices per face).

You might want to consider training the auto-encoder using 1 quantize to get the sequence length down. I've been able to do 500 triangles meshes using GPT small but at 1000 triangles I had to switch to gpt-medium.

So, would you suggest moving to GPT medium? I can also try reducing the quantization level, but it can reduce the quality of the autoencoder.

Using 1 quantizer is fine, it may 1.5x the training time but in your case that should be negligible. The benefit is that the token sequences goes from 4800 to 2400 which will help the fine-decoder which uses process the entire sequence. I've trained a GPT-small on 40-50k objects with a max triangle count of 500, I used 1 quantizer and will be releasing that model shortly. The output is near perfect so I don't see any issues from using a small dataset.

I had to switch up to a GPT-medium when I approached 80k objects along with 1000 faces, I think you should be fine.

You can lower the learning rate to 1e-3. Its very unstable above 1e-2. I have started another training with 1e-3 learning rate.

I haven't trained using a tiny dataset for s while but getting below 2.0 loss should be pretty easy.

Yes, I saw that the loss got close to approx 1.1

Is the loss stuck around 1.0?

Conditioning the model using text goes faster so you might want to consider that.

I do not have a direct textual conditioning candidate available for the dataset. However, I will try to explore a possible way.

Unfortunately, the GPT outputs at this loss value do not look very good.

I will train a bit longer. Maybe things improve at a lower loss value.

Thanks.

My advice is to use 1 quantizer for the auto-encoder or consider switching to GPT-medium. I recommend that you try using 1 quantizer first.

lucidrains / meshgpt-pytorch

hyper-parameter suggestion #93