the-crypt-keeper / can-ai-code

Self-evaluating interview for AI coders
https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
MIT License
525 stars 30 forks source link

Evaluate ajibawa-2023/Code-290k-13B #162

Closed ajinkya123-robo closed 7 months ago

ajinkya123-robo commented 8 months ago

Large Language Models (LLMs) are good with code generations. Sometimes they do make mistakes in code generation. How about if they can give detailed explanation along with the code. This is what I have tried over here. The base Llama-2 model was used for training purpose. It is trained on around 290000 set of codes. Each set having 2 conversations. Along with Python, Java, JavaScript, GO, C++, Rust, Ruby, Sql, MySql, R, Julia, Haskell, etc. code with detailed explanation is used for training purpose. Entire dataset was trained on 4 x A100 80GB. For 3 epoch, training took 165 hours. This was trained on Llama-2 by Meta. This is a full fine tuned model. This conversation is in Vicuna/ShareGPT format. Link: https://huggingface.co/ajibawa-2023/Code-290k-13B Thank you very much for evaluating my earlier models.

the-crypt-keeper commented 7 months ago

@ajinkya123-robo Solid performance on junior-v2 but wasn't able to complete the complex tasks in the new senior test.

ajinkya123-robo commented 7 months ago

Hello @the-crypt-keeper , thank you very much for quickly evaluating my model. I am surprised that it did poorly on senior test. I hope different template (Vicuna-Code v/s Vicuna-1p3-v2a/b) might not be the reason. Thank you once again!

the-crypt-keeper commented 7 months ago

@ajinkya123-robo I tried the v2a/v2b prompts as well as losening the sampling but it only made things worse. I dont think its your finetune it's just that Llama2 isn't a great base for advanced coding tasks. Have you considered trying this dataset on top of deepseek-6.7b?

ajinkya123-robo commented 7 months ago

Ok, I will try to do finetuning using deepseek-6.7b. Thanks for your time & effort.