Open fire opened 11 months ago
yup sounds good! just put all the functions into one file, say augment.py
, and if you want to go the distance, have ways to compose / chain any number of augmentations
@fire scale and rotation will go a long way
Here's what my current augments do.
vs original
Edited:
There's a bias near the center D:
The bias is removed.
I have to go for now.
See def augment_mesh(self, base_mesh, augment_count, augment_idx):
Edited: removed seed
@lucidrains Can you post something for me to extract the resulting mesh from the autoencoder?
You mentioned the topic of overfitting as a first step.
I added the Blender monkey as a validation of mesh input through an autoencoder as an initial step.
I want send another monkey to the autoencoder and get the same monkey out again. How do I do that?
I was able to train a 1 step that outputs garbage glb 🎉
You mentioned the topic of overfitting as a first step.
I added the Blender monkey as a validation of mesh input through an autoencoder as an initial step.
I want send another monkey to the autoencoder and get the same monkey out again. How do I do that?
I have been using Marcus provided Notebook file to try that, I am also getting bad obj results. I am going to try the latest @lucidrains changes tomorrow in this notebook, maybe you can try, give a look; or maybe you might be ahead of what I am using. 😆 Thanks! https://drive.google.com/file/d/1gpLjbnH1WUH6U50MJKrw-8BV6S_-3KH1/view?usp=sharing
I am getting bad mesh results too, but it's trying. The selected is the output, the background is the base mesh.
Just for testing purposes; give it a go without the data augment. I think there needs to be some more improvements with the model + it will take a long time to train with the data augment. In the paper they used 28 000 shapes and trained the encoder on 2x A100 for 2 days and 4x A100 for 5 days for the transformer. So it will need lots of training data and time.
When I have been successful, the encoder loss was less 0.200- 0.250 and the loss for the transformer was around 0.00007. So if you can get the loss using the data augmentation down to those levels it probably work but that will require lots of training
Here is some details from the paper, they only use scalar and jitter-shift. So remove translation & rotation and see if that helps.
I am currently at:
loss: 1.255
loss: 1.500
loss: 1.786
loss: 1.596
loss: 1.941
loss: 1.583
loss: 1.895
loss: 1.904
So maybe I can dream about 0.200 - 0.250 loss.
I am currently at:
loss: 1.255 loss: 1.500 loss: 1.786 loss: 1.596 loss: 1.941 loss: 1.583 loss: 1.895 loss: 1.904
So maybe I can dream about 0.200 - 0.250 loss.
How many steps are that at? I require about 2000 steps since 200 x10 epochs = 2000. Also implement tqdm since print can slow down quite alot.
Try only doing scalar and see, probably will go better.
You can give it a go with my forked version @ https://github.com/MarcusLoppe/meshgpt-pytorch/tree/main
The data MeshDataset expect is a array of:
obj_data = {"texts": "chair", "vertices": vertices, "faces": faces}
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
class MeshDataset(Dataset):
def __init__(self, obj_data):
self.obj_data = obj_data
print(f"Got {len(obj_data)} data")
def __len__(self):
return len(self.obj_data)
def __getitem__(self, idx):
return self.obj_data[idx]
from meshgpt_pytorch import (
MeshTransformerTrainer,
MeshAutoencoderTrainer
)
autoencoder_trainer = MeshAutoencoderTrainer(model = autoencoder,learning_rate = 1e-3, warmup_steps = 10,dataset = dataset,batch_size=4,grad_accum_every=1,num_train_steps=1)
autoencoder_trainer.train(10, True)
max_length = max(len(d["faces"]) for d in dataset if "faces" in d)
max_seq = max_length * 6
print(max_length)
print(max_seq)
transformer = MeshTransformer(
autoencoder,
dim = 16,
max_seq_len = max_seq,
#condition_on_text = True
)
trainer = MeshTransformerTrainer(model = transformer,warmup_steps = 10, dataset = dataset,learning_rate = 1e-2,batch_size=2,grad_accum_every=1,num_train_steps=1)
trainer.train(10)
These are my current settings which is 200 steps. The outlined is the output mesh. You can see my code in the pull request.
run = wandb.init(
project="meshgpt-pytorch",
config={
"learning_rate": 1e-2,
"architecture": "MeshGPT",
"dataset": dataset_directory,
"num_train_steps": 200,
"warmup_steps": 1,
"batch_size": 4,
"grad_accum_every": 1,
"checkpoint_every": 20,
"device": str(device),
"autoencoder": {
"dim": 512,
"encoder_depth": 6,
"decoder_depth": 6,
"num_discrete_coors": 128,
},
"dataset_size": dataset.__len__(),
}
)
You are right that I should ensure that we're in unit square distance and do less augmentations though.
You are right that I should ensure that we're in unit square distance and do less augmentations though.
I think that generating two objects are causing some issues, try using a singular box.
I tried your s_bed_full.glb file and the result was pretty good, it's not so smooth. Probably better result with data augmentation. The right side is the generated one.
https://imgsli.com/ is very good for image comparisons.
Writing down an idea. It should be possible to go over the 10 million 3d item set and find a small set of items in a small set of classes similar to the paper and label them manually (like via path name).
Writing down an idea. It should be possible to go over the 10 million 3d item set and find a small set of items in a small set of classes similar to the paper and label them manually (like via path name).
Training 10 million might be overkill and going over 28 000 shapes might cost a bit to much $$$. Shapenet got 50k 3d models with like almost a paragraph of description text.
Renting A100 at 0.79$ per hour: Training encoder on A100 x2 for 2 days: 75,84$ Training transformer on A100 x4 for 5 days: 379$
However H100 promises good performance but at like 2-3$ an hour.
https://imgsli.com/ is very good for image comparisons.
Seems pretty good, but probably not for 3D models
I can't use shapenet, but I'm sure we can find 10 class of 100 models like Shapenet in that 10 million dataset.
I can't use shapenet, but I'm sure we can find 10 class of 100 models like Shapenet in that 10 million dataset.
I think it's fine, there are many free sources, the trouble might be finding a dataset with descriptions. But that is in the future, I think someone can get access from Shapenet. But the bigger issue is the GPU bill, however Phil/lucidrains might be able to improve the models so much that the training time goes down dramatically.
But after the model is trained the issue the inference will be a big issue for users, if it's going to generate complex 3D models, it might not work on consumer hardware. But the recent performance boost is a good sign that the performance and effective is on the right track.
https://github.com/timzhang642/3D-Machine-Learning#3d_models
I want to mention, getting the indices so they're in the right order and making sure they fit in the box and not inside out are problems too.
If you're interested in training the head it's in the dataset. I can't get the autoencoder below 0.5 loss
I want to mention, getting the indices so they're in the right order and making sure they fit in the box and not inside out are problems too.
If you're interested in training the head it's in the dataset. I can't get the autoencoder below 0.5 loss
How many examples/steps of the same 3d mesh did you train it on? I trained for 10-20 epochs @ 2000 examples and got 0.19 loss. I think you are training on too few examples, it needs massive amounts of data to model. And if you do data augmentation you'll need even more data, maybe 30-40 epochs or more.
I was able to generate a pretty good 3d mesh, it's not as smooth but very good result for such small amount of training data. The transformer & encoder isn't good at generalizing with low training data but that will resolve itself when training with much more data.
3D mesh: https://file.io/6JIueypFnRyT
I was using the wrong strategy. You were using many same copies of the mesh and then some augments. I was doing the opposite.
I was using the wrong strategy. You were using many same copies of the mesh and then some augments. I was doing the opposite.
I might have worded that badly but no, I'm using the same model without any augmentations. But train for 10/20 epochs @ 2000 items per dataset and let me know. Kaggle has some awesome free GPU's.
Here's what I interpreted it.
You were doing 2000 (same) 1 1.
I was trying 1 2000 (agumented) 1.
Thanks for telling me! I'm trying your suggestion.
Here's what I interpreted it.
1. model * multiple 2. model * multiple * augments
You were doing 2000 (same) 1 1.
I was trying 1 2000 (agumented) 1.
Thanks for telling me! I'm trying your suggestion.
No problem, I posted this in another issue but I think this might help you; according to the paper they sort the vertices in z-y-x order. Then sort the faces as per their lowest vertex index.
Also, I'm current training on like 6 3d mesh chairs. Each chair has 1500 examples, but it have 3 augmentation version . So each 3d mesh file have a total of 500 x 3 =1500 examples.
The total is 12 000 examples.
To give you some type of idea of why you need to train for 2 days on two A100, watch how slow the progress is (33 minutes running):
Epoch 1/20: 100%|██████████| 1125/1125 [03:29<00:00, 5.38it/s, loss=0.296]
Epoch 1 average loss: 0.7889469708336724
Epoch 2/20: 100%|██████████| 1125/1125 [03:23<00:00, 5.52it/s, loss=0.307]
Epoch 2 average loss: 0.29623086002137927
Epoch 3/20: 100%|██████████| 1125/1125 [03:23<00:00, 5.54it/s, loss=0.28]
Epoch 3 average loss: 0.2731376721594069
Epoch 4/20: 100%|██████████| 1125/1125 [03:22<00:00, 5.54it/s, loss=0.248]
Epoch 4 average loss: 0.25995001827345954
Epoch 5/20: 100%|██████████| 1125/1125 [03:23<00:00, 5.54it/s, loss=0.239]
Epoch 5 average loss: 0.251056260228157
Epoch 6/20: 100%|██████████| 1125/1125 [03:23<00:00, 5.53it/s, loss=0.217]
Epoch 6 average loss: 0.24529405222998726
Epoch 7/20: 100%|██████████| 1125/1125 [03:23<00:00, 5.54it/s, loss=0.227]
Epoch 7 average loss: 0.24055371418264176
Epoch 8/20: 100%|██████████| 1125/1125 [03:22<00:00, 5.54it/s, loss=0.221]
Epoch 8 average loss: 0.23791699058479732
Epoch 9/20: 100%|██████████| 1125/1125 [03:23<00:00, 5.54it/s, loss=0.245]
Epoch 9 average loss: 0.23742892943488228
Epoch 10/20: 100%|██████████| 1125/1125 [03:23<00:00, 5.54it/s, loss=0.208]
Epoch 10 average loss: 0.23614923742082383
Epoch 11/20: 100%|██████████| 1125/1125 [03:23<00:00, 5.53it/s, loss=0.219]
Epoch 11 average loss: 0.23556399891111585
https://github.com/lucidrains/meshgpt-pytorch/issues/11#issuecomment-1856353929 was the verification of z-y-x order and sort the faces as per their lowest vertex index. Note that I am using the convention that gives me that result like Y-Z-X, but it followed their requirement of sorted vertically.
@MarcusLoppe on your branch, can you add a feature that on the first quit I save, on the second quit quit. Then, we can restart from a checkpoint.
#11 (comment) was the verification of z-y-x order and sort the faces as per their lowest vertex index. Note that I am using the convention that gives me that result like Y-Z-X, but it followed their requirement of sorted vertically.
Oh, great :) I'm currently testing and seeing if using 50% of the 3d mesh examples to be full and the rest of the faces are stepped on, e.g 0 to max(faces). My idea is that when generating the 3d mesh, the embedder might freak out since it have never seen a input graph that is not full. I'll let you know how it goes.
One other tip might be to normalize the size and set everything on the ground. If i'm correct; the below will set the max value of a vertices to 1 and min 0, then set everything on the ground.
I'm limiting the size since I'm current training on a few different chairs and some of the chairs where huge like a building while others where "normal" size.
max_abs = np.max(np.abs(vertices))
vertices = vertices / max_abs
min_y = np.min(vertices[:, 1])
vertices[:, 1] -= min_y
@MarcusLoppe on your branch, can you add a feature that on the first quit I save, on the second quit quit. Then, we can restart from a checkpoint.
I don't understand, can you clarify?
This is my current result.
I'll retype the last message in a bit.
output.log See also https://wandb.ai/ernest-lee/meshgpt-pytorch/runs/2fkwahjc/overview
This is my current result.
I'll retype the last message in a bit.
output.log See also https://wandb.ai/ernest-lee/meshgpt-pytorch/runs/2fkwahjc/overview
I see that the dataset size is 10, for training effective I just duplicate the one model x2000 times since it can train faster I think when dealing with bigger loads. Since you are using a 3090 you can probably up batch size to 8 or 16. The only reason why I had the batch size at 1 or 4 was due to VRAM constraints but the encoder & transformer are now much more memory effective.
The learning rate seems bit high, for the encoder i used 1e-3 (0.001) and for the transformer i used 1e-2 (0.01). When the loss becomes quite low for the transformer you can try using a lower learning rate such as 1e-3.
https://wandb.ai/ernest-lee/meshgpt-pytorch/runs/9b8k9mfc/overview?workspace=user-ernest-lee
I have some bugs, but this is really promising.
I had to recode my face index asc regularization strategy.
The clipped ears is the meshgpt.
I see that the dataset size is 10, for training effective I just duplicate the one model x2000 times since it can train faster I think when dealing with bigger loads.
Instead of duplicating the model, I multiply the epoch by n, but according to the graph the training flattens so I stop early.
I broke the counter clockwise triangle order, but it's invisible in this shot.
https://wandb.ai/ernest-lee/meshgpt-pytorch/runs/9b8k9mfc/overview?workspace=user-ernest-lee
I have some bugs, but this is really promising.
I had to recode my face index asc regularization strategy.
The clipped ears is the meshgpt.
That seems very good, I see that you increased the num_discrete_coors to 256. Did that help? Seems like that would smooth out the errors/give it a higher error margin so even if it's wrong it looks smoother.
What kind of augmentation are you doing? Are you applying all the augmentations including the rotation? I'm bit unsure about the rotation one since neither MeshGPT or PolyGen mention it, only the scalar & jitter.
Is there any reason why you are adding 2 extra tokens as padding?
seq_len = dataset.get_max_face_count() * 3
seq_len = ((seq_len + 2) // 3) * 3
Here's my current augmentations. It's in the git.:
def augment_mesh(self, base_mesh, augment_count, augment_idx):
# Set the random seed for reproducibility
random.seed(self.seed + augment_count * augment_idx + augment_idx)
# Generate a random scale factor
scale = random.uniform(0.8, 1)
vertices = base_mesh[0]
# Calculate the centroid of the object
centroid = [
sum(vertex[i] for vertex in vertices) / len(vertices) for i in range(3)
]
# Translate the vertices so that the centroid is at the origin
translated_vertices = [[v[i] - centroid[i] for i in range(3)] for v in vertices]
# Scale the translated vertices
scaled_vertices = [
[v[i] * scale for i in range(3)] for v in translated_vertices
]
# Generate a random rotation matrix
rotation = R.from_euler("y", random.uniform(-180, 180), degrees=True)
# Apply the transformations to each vertex of the object
new_vertices = [
(np.dot(rotation.as_matrix(), np.array(v))).tolist()
for v in scaled_vertices
]
# Translate the vertices back so that the centroid is at its original position
final_vertices = [[v[i] + centroid[i] for i in range(3)] for v in new_vertices]
# Normalize uniformly to fill [-1, 1]
min_vals = np.min(final_vertices, axis=0)
max_vals = np.max(final_vertices, axis=0)
# Calculate the maximum absolute value among all vertices
max_abs_val = max(np.max(np.abs(min_vals)), np.max(np.abs(max_vals)))
# Calculate the scale factor as the reciprocal of the maximum absolute value
scale_factor = 1 / max_abs_val if max_abs_val != 0 else 1
# Apply the normalization
final_vertices = [
[(component - c) * scale_factor for component, c in zip(v, centroid)]
for v in final_vertices
]
return (
torch.from_numpy(np.array(final_vertices, dtype=np.float32)),
base_mesh[1],
)
Is there any reason why you are adding 2 extra tokens as padding?
The generated tokens length needs to be a multiple of 3.
I see that you increased the num_discrete_coors to 256
To be honest I think this only affects the quantization loss on the discretionary of the mesh vertex positions.
I don't think it matters, but I haven't tested it off.
Is there any reason why you are adding 2 extra tokens as padding?
The generated tokens length needs to be a multiple of 3.
It should be 6 since 1 face = 6 tokens.
I see that you increased the num_discrete_coors to 256
To be honest I think this only affects the quantization loss on the discretionary of the mesh vertex positions.
I don't think it matters, but I haven't tested it off.
It should make it smoother since if it guesses wrong class of 128 vs 256 classes; the step values might be 0.20 vs 0.10, the 0.1 error will be less visible.
https://wandb.ai/ernest-lee/meshgpt-pytorch/runs/dn4mqfoj/overview?workspace=user-ernest-lee [Edited]
Training a single mesh seems to be going pretty good/solved, have you tried using the texts & multiple meshes? Try with just 2-3 meshes and see how it goes, it's very slow to train the transformer with more then one mesh.
I'm guess that you resolved the issue with the mesh get cut off? I just scale it to fit -0.95 to +0.95, seems like there are some issues when the mesh gets above at 1.0.
Also; I was granted access to the shapenet v2 dataset on huggingface, you can probably get access as well.
I was able to train the transformer to use 1172 faces.
mesh_transforms_humanoid_avatar.zip
I respect the MIT, Apache-2 and cc-by licenses and so have a reason to not use shapenet.
https://github.com/V-Sekai-fire/meshgpt-pytorch/commit/a416837e5d9fadd4092f2f491886ee4019d31001
https://wandb.ai/ernest-lee/meshgpt-pytorch/runs/rp8nbw7w?workspace=user-ernest-lee Some logs.
Duration: 1h 13m 26s
upbeat-waterfall-437-618dbfb6d54f78d191f293a55a0c9e7a41147541.json
Training a single mesh seems to be going pretty good/solved, have you tried using the texts & multiple meshes? Try with just 2-3 meshes and see how it goes, it's very slow to train the transformer with more then one mesh.
I want to do after a break. Any suggestions? I was thinking of having one human be in multiple poses, but different objects is doable too.
I was able to train the transformer to use 1172 faces.
mesh_transforms_humanoid_avatar.zip
I respect the MIT, Apache-2 and cc-by licenses and so have a reason to not use shapenet.
https://wandb.ai/ernest-lee/meshgpt-pytorch/runs/rp8nbw7w?workspace=user-ernest-lee Some logs.
Duration: 1h 13m 26s
upbeat-waterfall-437-618dbfb6d54f78d191f293a55a0c9e7a41147541.json
I think it's fine to train while testing since it's not for any commercial purpose but pure testing that won't be touched by anyone else.
One benefit of using shapenet is they got nice labels and not just category's like "chair", examples: "name": "easy chair,lounge chair,overstuffed chair", "name": "water faucet,water tap,tap,hydrant", "name": "ladder-back,ladder-back chair",
I want to do after a break. Any suggestions? I was thinking of having one human be in multiple poses, but different objects is doable too.
Yes, use very low faces mesh since using text to encode makes the training much harder.
Using a dataset of 2 chairs with 5000 examples (2 meshes, 5 augmentations x 500) I got the encoder to 0.2 loss after 2 epochs but the transformer is at 0.001695 loss after 40 epochs and taken 2h's.
@MarcusLoppe I'm pretty sure you can use blip to categorize photos of the mesh so that's not a blocker. https://replicate.com/gfodor/instructblip
Someone wanted me to try https://www.kenney.nl/assets/castle-kit. So I'll need to generate labels for them, but it should work.
@MarcusLoppe I'm pretty sure you can use blip to categorize photos of the mesh so that's not a blocker. https://replicate.com/gfodor/instructblip
Well the downside is that you'll use blender to take a screenshot with a default camera and since models vary with the orientation/vertical axis you might take a snapshot at back/below of the object. Why complicate it? :)
Someone wanted me to try https://www.kenney.nl/assets/castle-kit. So I'll need to generate labels for them, but it should work.
Try walking before running :) I've been trying to tell that you need massive amount of data and training time to actually create a good enough model for that. Currently you have been over fitting a model with a very small sample of data. The harder part is when you want to create a general model that can generate general items.
I've been successful on over fitting it using text + 1 single model for around 40 epochs at 2000 examples per epoch. If i use two models that are the same type of object e.g chair, it fails massively.
If you want to give it a go, use only the models with less then 500-600 faces and then create 10-20 augmentations per model, then duplicate each variation 200 times. If you want to train it using 40 objects = 10 200 40 = 80Â 000 examples per dataset.
Then train on this for a day or two and then try to generate using the texts.
In the PolyGen and MeshGPT paper they stress that they didn't have enough training data and used only 28 000 mesh models. They needed to augment those with lets say 20 augments, this means that they trained on 560 000 mesh models. Since they only did autocomplete it makes the generation much easier then using texts.
In the paper they used 28 000 3d models, lets say they generate 10 augmentations per each model and then used 10 duplicates since the it's more effective to train a model with big batch size of 64 and when you are using a small number of models per dataset it will not train effectively and you will waste parallelism of GPUs. This means that : 10 10 = 100 28 000 = 2Â 800Â 000
I want to stress this: Over fitting a model = super easy. Training a model to be general enough for many different models = Hard.
From @MarcusLoppe
But since it seems like you are not using the texts you can try to feed the transformer a prompt of 10-30 connected faces of a model and see what happens (like in the paper), it should act as a autocomplete.
In https://github.com/lucidrains/meshgpt-pytorch/pull/6
For each mesh I generate augments_per_item (like 200), then I use it to index into the dataset.
Using a seed I augment using this strategy.
What do you think?
The goal is for a chair item to be rotated, moved or scaled, but upright.
Edited:
The idea is to have a chair be displaced but under gravity so it keeps its lowest vertex position.