salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.65k stars 391 forks source link

How to use embedding model for semantic code search #137

Closed vardhan26 closed 11 months ago

vardhan26 commented 11 months ago

I am trying to use the code t5+ 110M embedding model for semantic code search and I am new to both huggingface and pytorch. While trying to generate the embeddings for my code dataset, I am getting the CUDA out of memory error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 11.94 GiB already allocated; 0 bytes free; 11.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I am using Lenovo legion 5 pro laptop with nvidia RTX 3060 6GB GPU. This the code I'm using to generate the embeddings:

test_codet5p_embeddings =[] max_length=360 for text in test_func_embeddings: text_input = tokenizer(text, padding='max_length', truncation=True, max_length=max_length,return_tensors="pt").to(device) embed = model(text_input.input_ids, attention_mask=text_input.attention_mask) test_codet5p_embeddings.append(embed)

yuewang-cuhk commented 11 months ago

Hi there, this should be due to that the GPU memory is too small to accommodate the model. You can try to use smaller batch sizes or try to find GPU with larger memory.

vardhan26 commented 10 months ago

Thanks. It worked with a smaller batch size.