lucidrains / meshgpt-pytorch

Implementation of MeshGPT, SOTA Mesh generation using Attention, in Pytorch
MIT License
744 stars 59 forks source link

Proper dataset that reduces VRAM usage and provides higher performance. #37

Open MarcusLoppe opened 10 months ago

MarcusLoppe commented 10 months ago

I've created a dataset class which hopefully help beginners.

Features:

MarcusLoppe commented 10 months ago

@fire Hey, I think you can use this information. I don't think you have implemented this.

fire commented 10 months ago

Thanks! Happy holidays. I am still sleeping on how to do chunking as an autocomplete. Because I don’t have 10x the gpu ram

fire commented 10 months ago

I think it needs the concept of mesh to mesh and the idea you can localize the input

MarcusLoppe commented 10 months ago

Thanks! Happy holidays. I am still sleeping on how to do chunking as an autocomplete. Because I don’t have 10x the gpu ram

Happy holidays 🎉 But it should be a lot lower VRAM if you preprocess the face edges at at least, I see that you are calling derive_face_edges_from_faces in getitem but not storing the results, hence you need to get the face edges each step and might cost a couple of GBs of VRAM.

Do you mean chunking this? I think it's the T5 encoder that are embedding these, shouldn't be to high VRAM usage.


    def embed_texts(self, transformer : MeshTransformer): 
        unique_texts = set(item['texts'] for item in self.data)

        text_embeddings = transformer.embed_texts(list(unique_texts))
        print(f"[MeshDataset] Generated {len(text_embeddings)} text_embeddings") 
        text_embedding_dict = dict(zip(unique_texts, text_embeddings))

        for item in self.data:
            text_value = item['texts']
            item['text_embeds'] = text_embedding_dict.get(text_value, None)
            del item['texts']

I think it needs the concept of mesh to mesh and the idea you can localize the input

Could you clarify? Are you talking about the augmentations/data cleaning?

fire commented 10 months ago

The problem being solved is I have a 70,000 triangle mesh and I can't process all of it at once.

We can process only like 10% of that. Like the 7k portion of the triangle mesh of the named subway car or the named feminine character.

fire commented 10 months ago

Here's a rewording of the problem.

Problem Context

The problem at hand involves processing a large triangle mesh with limited GPU RAM. This can be challenging as the size of the mesh may exceed the available memory, causing performance issues or even failure to process the mesh. The issue becomes more pronounced when dealing with 10x the input size.

Current Approach and its Limitations

Currently, I cache derive_face_edges_from_faces function in getitem. While this approach works, it's not efficient because it derives the face edges in each step, which can be computationally expensive and time-consuming. Moreover, the savings from caching might not be sufficient for larger inputs.

I am clarifying that the T5 embedding is not causing a problem.

Alternative Approach

Given the limitations of the current approach and potential solution, an alternative approach could be to divide the mesh into smaller chunks and process each chunk separately. This way, you can handle larger meshes without exceeding your GPU RAM capacity.

This approach allows for the processing of only a portion of the triangle mesh at a time, effectively managing the use of GPU RAM. It should provide a scalable solution for handling larger inputs. However, we don't have a mesh to mesh workflow yet.

MarcusLoppe commented 10 months ago

The problem being solved is I have a 70,000 triangle mesh and I can't process all of it at once.

We can process only like 10% of that. Like the 7k portion of the triangle mesh of the named subway car or the named feminine character.

Hmm, have you tested using the CPU only? So instead of the faces being on the GPU you can move it onto the CPU and since the computer might have more RAM then the GPU. If the computer RAM isn't enough it will use the virtual RAM which might be quite slow though.

Does this issue only occurs when you generate the face edges? Or are you running out of memory when training the autoencoder too? If so; you might want to check out replacing the autoencoder with MeshDiscretizer since it only discretizes and no machine learning model.

fire commented 10 months ago

Hi Marcus,

You're correct in your understanding of the problem. Due to the size of the triangle mesh, we can only process about 10% of it at a time, such as the 7k portion of the subway car or the feminine character.

I have indeed tested using only the CPU for processing. While this approach works because I have close to 200GB of CPU RAM, it significantly slows down the transformer stage. As you mentioned, if the computer RAM isn't enough, it will use the virtual RAM which is quite slow.

The issue primarily occurs when training the mesh transformer. The autoencoder stage doesn't seem to require very large inputs, but rather a variety of inputs. Therefore, replacing the autoencoder with MeshDiscretizer might not be necessary in this case.

Thank you for your suggestions and insights. They are greatly appreciated as we continue to work on optimizing this process.

MarcusLoppe commented 10 months ago

Hi Marcus,

You're correct in your understanding of the problem. Due to the size of the triangle mesh, we can only process about 10% of it at a time, such as the 7k portion of the subway car or the feminine character.

I see your problem, the derive_face_edges_from_faces is optimized for speed since it doesn't loop through the edges but process the whole dim at once. The current method can't really chunk the process since it need to check all face's for matches and what if a connected face is another chunk. It should be able to run with lower amount memory, I can try and give it a shot using dicts but it will be slower due to no GPU with parallel processing. Or maybe splitting up the mesh into Octotrees or traverse the mesh.

I have indeed tested using only the CPU for processing. While this approach works because I have close to 200GB of CPU RAM, it significantly slows down the transformer stage. As you mentioned, if the computer RAM isn't enough, it will use the virtual RAM which is quite slow.

Hmm, well since you can store the face edges on disk this is something you'll only need to do once, and only one per 3D model since the augmented versions still uses the same faces.

The issue primarily occurs when training the mesh transform. The autoencoder stage doesn't seem to require very large inputs, but rather a variety of inputs. Therefore, replacing the autoencoder with MeshDiscretizer might not be necessary in this case.

Thank you for your suggestions and insights. They are greatly appreciated as we continue to work on optimizing this process.

I'm 100% sure what happens but the transformer has a token max length, I'm not sure what happens when it exceeds 8192 tokens, e.g 1365 triangles (8192/6 tokens per triangle).

Have you seen any difference when training with more then 1365 triangles vs meshes with less then 1300 triangles?

MarcusLoppe commented 10 months ago

Here's a rewording of the problem.

Problem Context

The problem at hand involves processing a large triangle mesh with limited GPU RAM. This can be challenging as the size of the mesh may exceed the available memory, causing performance issues or even failure to process the mesh. The issue becomes more pronounced when dealing with 10x the input size.

Current Approach and its Limitations

Currently, I cache derive_face_edges_from_faces function in getitem. While this approach works, it's not efficient because it derives the face edges in each step, which can be computationally expensive and time-consuming. Moreover, the savings from caching might not be sufficient for larger inputs.

Have you tried not using the cache function? I'm not familiar with lru_cache and not sure if they are copying or reference the data and how it deals if the data is on another device (e.g cpu/GPU).

Best practice would be to generate the beforehand since if you using a 3D model and you augment it 100 times, since the faces edges wont change after augmentation you need only to store it in the VRAM once and reference it once instead of storing 100 copies of it in the cache.

And also, if you are using a batch size of more then 1, are calling derive_face_edges_from_faces with not just one mesh, but it will process the entire batch at once, e.g 16 x 7000 x 3 = a lot of data.

fire commented 10 months ago

I am trying a different approach.

I added a commit that uses KDTree from the scipy.spatial module to improve the efficiency of nearest neighbor search in the MeshDataset class. The KDTree is used to extract a subset of faces based on their proximity to a randomly generated point within the bounding box of the mesh. This subset is then used to create new vertices and faces for augmentation. Additionally, the maximum number of faces allowed in a mesh has been set to 500.

I have now set it to 1365

Later today I can try implementing the way you suggested.

MarcusLoppe commented 10 months ago

I am trying a different approach.

I added a commit that uses KDTree from the scipy.spatial module to improve the efficiency of nearest neighbor search in the MeshDataset class. The KDTree is used to extract a subset of faces based on their proximity to a randomly generated point within the bounding box of the mesh. This subset is then used to create new vertices and faces for augmentation. Additionally, the maximum number of faces allowed in a mesh has been set to 500.

I have now set it to 1365

Hmm, that seems maybe unnecessary.

I did some tests with a 6206 face model; If i created it using generate_face_edges(), this consumed 2.2GB VRAM, I then saved the dataset and restarted the session, I loaded the dataset from disk and the VRAM usage was just 592 MB. I ran torch.cuda.empty_cache() after generate_face_edges to clear some garbage but it was still at that usage.

Later today I can try implementing the way you suggested.

Good, I think it's much more efficient if you get the face edges and store 1 array per model, instead of creating and storing 100s of the same due to augmentation.

MarcusLoppe commented 10 months ago

Hi @lucidrains

I think it's time for a dataset since people are not training properly. Currently I see misconceptions of how it generates face edges/ codes/ text embedding when training.

If you don't preprocess the dataset: Each time it generates the data and then moves onto the next batch; the data will be deleted since it's stored on a temporary object and will never be stored or used again. This is due to the dataloader don't expose the real data but a copy of it.

This forces the model to generate face_edges/code and text embedding each step, this will increase VRAM usage and the time required to generate this data. This isn't a small amount but rather a big one, for example face_edges requires 4-96GB VRAM.

I have multiple people not quite understanding this and they don't understand why they have a high VRAM usage.

Since the VRAM usage is linear, it uses about 12 MB per face using batch size of 64, if you have a face count of 6000, it will use 75GB to generate the face_edges+tokenize the data which can be pre-generated without any significant VRAM usage.

294633603-c7d0312f-3b75-45c1-bb9a-e37ae2dde876

lucidrains commented 10 months ago

yup, will look into this Sunday!