Closed shanemankiw closed 8 months ago
Hi Jionghao and thanks for the kind words
Your results line up with some papers I've been reading recently. Could you try a pixelnorm, either in place of the groupnorm, or in the direct main path, and see if it leads to comparable results to your layernorm run? Today is my last day open sourcing, but I can throw this last change if you get the experiments to me in time
Sure I will try it! But I will have to get back to you after I wake up in the morning, in no less than 7-8 hours... I highly suspect the overfitting results would be similar, at least on my humble small dataset. From my understanding, once the norm dimension is over the 128 feature dimension and has nothing to do with the sequence length dimension, the results should be fine. But let's wait for the results. btw, why do you favor pixelnorm over layernorm? I am not very familiar with pixelnorm's advantages
@shanemankiw there's a trend in transformers to remove the mean centering in layernorms (rmsnorm), so it lines up with Tero Karras' usage of pixelnorm
Thank you for the heads up about pixelnorm, very informative! The experiment results are indeed similar, the loss curves look almost the same(ignore the 100/280 difference in the run name...): And I saw the mesh reconstructions, they are flawless as well.
@shanemankiw thank you for those results 🙏 i've made the change in 1.1.0
just in the nick of time! alright, time to get back to those emails. go make the holodeck happen 😉
@lucidrains @shanemankiw
At the start of the project I tested using smaller group sizes and even used layernorm before while testing but it seems like GroupNorm was better due to better loss improvements.
But I encountered the issue that 90% of the models perfect reconstructed and massively screw up the rest. I was thinking it might have been some shapes were easier to generalize then others. But it might have been due to some of the shapes where being normalized too much so anything outside the 'norm' of the batch average got normalized till it got squished. An example of this was that the dataset contains many similar looking thick chairs and tables, but when it came to the models that are bit of the norm, like one legged chair or super thin glass table it messed them up quite bit.
I've been testing now with layernorm and it seems like that issue is gone! The 'catastrophe forgetting' is no longer and problem and it seems like even using a 2k codebook manages to store the shapes accuracy! Using such a small codebook It might even mean that I can release a little demo since the transformer wont take too much time training. After that I'll start right away with the holodeck.
Here is a example of the 'catastrophe forgetting' ( i think it had like 0.4 loss with a bit bigger parameter count (33M+), probably 12h+ training):
vs Training using (almost) the same parameters as the paper (15M), this would never have worked in the past: 0.43 loss after 4hrs:
PS: Here is the parameters, I used a smaller embedding dims since using a larger ones causing some problems (the default creates a total embedding space of about 840 vs the paper 192).
num_layers = 23
autoencoder = MeshAutoencoder(
decoder_dims_through_depth = (128,) * 3 + (192,) * 4 + (256,) * num_layers + (384,) * 3,
dim_codebook = 192 ,
codebook_size = 2048,
dim_area_embed = 16,
dim_coor_embed = 16,
dim_normal_embed = 16,
dim_angle_embed = 8
)
@lucidrains Thanks for your efforts!
@MarcusLoppe Thanks for the experiments! The 'catastrophe forgetting' problem you talked about is precisely the problem that made me start debugging. You would think that a model this size could figure out a way to overfit on a few hundred meshes, but it always fail on around 10% of the cases. In the paper, MeshGPT could achieve 98% accuracy on even the test set, so this is definitely not normal... The thing about the loss is that, even if you can achieve a low loss under GroupNorm at batchsize>1, the output would not be the same during evaluation at batchsize=1.
@MarcusLoppe awesome! thanks for the corroboration!
you should switch into the field.. i really think you have a lot of potential
even your name is initialed ML lol
PS: i'm not kidding about the holodeck. in a decade, mark my words
@MarcusLoppe Thanks for the experiments! The 'catastrophe forgetting' problem you talked about is precisely the problem that made me start debugging. You would think that a model this size could figure out a way to overfit on a few hundred meshes, but it always fail on around 10% of the cases. In the paper, MeshGPT could achieve 98% accuracy on even the test set, so this is definitely not normal... The thing about the loss is that, even if you can achieve a low loss under GroupNorm at batchsize>1, the output would not be the same during evaluation at batchsize=1.
In the above I used 150 chairs and 150 tables and augmented each x50 times, so the dataset is 15000 meshes. I ran a test for the 15000 meshes and got the MAE results: Avg: 0.004, Min: 0.0028, Max: 0.016 I ran the code below to calculate the MAE, as you can see it's not batch processing.
I'm pretty sure that it's possible to get great results as in the paper.
for item in tqdm(dataset.data, desc="Processing samples"):
codes = autoencoder.tokenize(
vertices = item['vertices'],
faces = item['faces'],
face_edges = item['face_edges']
)
codes = codes.flatten().unsqueeze(0)
codes = codes[:, :codes.shape[-1] // 2 * 2]
coords, mask = autoencoder.decode_from_codes_to_faces(codes)
orgs = item['vertices'][item['faces']].unsqueeze(0)
abs_diff = torch.abs(orgs.view(-1, 3).cpu() - coords.view(-1, 3).cpu())
mae = torch.mean(abs_diff)
While running I stored the worse and best results, in the image you'll see the row the best and the second the worse, the rest is 40 random samples. In my book, that is a perfect result!
Here is the worst mesh, you can see some defects but that's pretty good after a few hours training!
I trained across 16 different categories with 50 models (800 models total) each and augmented them x100 times (80k meshes), I let it run for about 10hrs and got 0.5 mse loss. The results usually gets good at 0.4 loss so some fragments are expected. I used a 2k codebook size to test if the chair and tables where just so simple shapes it could be compressed into a small codebook, but it seems like even loads of different shapes can be compressed! Although a hint that the codebook size is bit is that the commit loss was high when I restarted the training run, it usually gets lower after a training in the same session for a while.
@MarcusLoppe awesome! thanks for the corroboration!
you should switch into the field.. i really think you have a lot of potential
even your name is initialed ML lol
PS: i'm not kidding about the holodeck. in a decade, mark my words
I've heard about the ring attention in the last week in AI podcast, it seems like they used with sparse attention and some dozen of other small things. I'm not quite sure about it lives up to the hype, in the testing I see that they ask it about one thing in the context window, what if you ask it a abstract question which it needs to find 10-20 needles in the haystack/context window? 😕
Maybe, I'm not a ML programmer or know how to debug a model as @shanemankiw did, if I did I've might've been able to resolve this issue a long time ago :(
But I like using and training them in my software, for example I used Mistral-7B to extract and output the requirements from job adverts in JSON. I was extracting informations such as hard skills, soft skills, certifications, company culture, education and other qualifications. It's not perfect but i got like 89k labels from like 4k job adverts in 12 hours. Then I extracted the data from the JSON output and fine-tune Reranker on the different labels and using unmatched bread-text or the other labels as negatives.
I then convert it to a ONNX model used it in my asp .net backend and made a little nice react front-end this way you can quickly sift through many job ads so you don't need to waste time and read the whole thing until you'll realize they want 7+ years of experience :)
Notice the 'job duties' it marked? :) It knows too much 😨
@MarcusLoppe Your results are great! Thank you so much for sharing. All of these with only 2k codebook? This thing sure has a lot of potential. btw I don't know if I am qualified to say this, but I concur with all the nice things @lucidrains said about you. My tests on this project could not have gone anywhere without your notebook demo! Besides, the way you design and present your experiments are fantastic, and your results in multiple issues have been extremely helpful.
@MarcusLoppe Your results are great! Thank you so much for sharing. All of these with only 2k codebook? This thing sure has a lot of potential. btw I don't know if I am qualified to say this, but I concur with all the nice things @lucidrains said about you. My tests on this project could not have gone anywhere without your notebook demo! Besides, the way you design and present your experiments are fantastic, and your results in multiple issues have been extremely helpful.
Correct only using 2k, Here it is at 0.42 loss, there is some fragments but there was still some room for improvements for the loss During a 11hr run using kaggle's free P100 I went from: 0.45 @ 0.8 commit loss to 0.4235 @ 0.58 commit loss. I think this means that it can still compress the meshes some more.
Thank you very much :) I appreciate yours and @lucidrains comments, not many people in real life cares about this so it's refreshing and heart warming to get some compliments :)
https://file.io/Mpg7AoUYoBgC (the mse_rows(63) contains the original model plus the reconstructed)
@shanemankiw I got some strange results... I was thinking of how AlphaGeometry managed to get results with a relative small model, it has a vocab of 757 tokens and 1024 context window. They were talking how they tested with small vocab size due to compress the information to reduce the complexity. So I used 400 meshes (less then 250 faces) from 16 categories and augmented them x50, resulting with about 20k meshes.
Then I tested using 128 codebook size and I had great success, 0 fragments and took like 2hrs to reach 0.44 loss. The commit loss was consistently low so I guess to reach any sort of good results requires you to somehow estimate the codebook size. It sort of explains the bad results and that you need to adjust the model to the dataset.
You'll probably need a bigger codebook for more meshes but when dealing with testing/smaller dataset it's probably better to use a smaller codebook size.
Hi,
Thanks for your code. Your implementation is an amazing starting point for further research based on MeshGPT. However, I could not make it overfit on a small dataset of around 200 triangle shapes(the shapes have varying numbers of faces) when using batchsize > 1. I highly suspect it is because you used Groupnorm instead of Layernorm in your decoder resblocks here https://github.com/lucidrains/meshgpt-pytorch/blob/f8e30edf5e52b819034cc8e00e28451d1498c6ac/meshgpt_pytorch/meshgpt_pytorch.py#L262
I found during debugging that the outputs of self.norm(x)[mask] and self.norm(x[mask]) (not exactly the code, but you get the idea) are significantly different with groupnorm. So models trained under batchsize>1(when mask is valid) produce wrong meshes when used for evaluating at batchsize=1. So I rewrote it with layernorm:
class Block(Module): def init(self, dim, dim_out=None, groups=8, dropout=0.0): super().init() dim_out = default(dim_out, dim)
After I switch to layernorm, it overfits fairly smoothly under large batchsize and could also work for bs=1. Note that I did also make some other changes for my personal use, but I think this normalization choice is a key factor here.
Looking forward to your opinion on this.