the-crypt-keeper / can-ai-code

Self-evaluating interview for AI coders
https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
MIT License
512 stars 29 forks source link

Investigate problems with ctranslate2 #75

Open SebastianBodza opened 1 year ago

SebastianBodza commented 1 year ago

Is it possible to investigate the problems of ctranslate2 in more detail? The library is one of the fastest and supports token streaming. Unfortunately with beam search no token streaming is possible and there the performance is quite bad :/

Is there any way to run the interview locally?

P.s. in the readme cformers2 should be ctranslate2

SebastianBodza commented 1 year ago

Shouldn't the parameter for the beamsize be beam_size instead of num_hypotheses? def generate(self, prompt, params): According to the following https://opennmt.net/CTranslate2/python/ctranslate2.Generator.html#ctranslate2.Generator.generate_batch

the-crypt-keeper commented 1 year ago

@SebastianBodza I really need to do a better job in the README, yes you can run everything locally.

Create the prompts with prepare.py --template prompts/Wizard-Coder.txt which should say Expanded 28 Wizard-Coder prompts to results/prepare_junior-v2_python-javascript_Wizard-Coder.ndjson

Now you can run the ctranslate2 (why does my brain refuse to remember this correctly ugh) interview with:

./interview_cuda.py --runtime ctranslate2 --model_name michaelfeil/ct2fast-WizardCoder-15B-V1.0 --params params/wizardcoder.json --input results/prepare_junior-v2_python-javascript_Wizard-Coder.ndjson

This will download the model from HF if it's not already cached. My initial observations when implementing this runtime in #62 were that if you try params/precise.json instead of params/wizardcoder.json the results were very different then what every other runtime produced with those settings (and not very good).

As to your second point: that's an interesting thought. There should be 2 paramters to beam searching, one for the number of beams to consider and another for the size or length of those beams. When I first went through the docs I left with the impression that the beam_size parameter is the beam length, while num_hypotheses is the number of beams but now I'm not so sure and it's possible I got them backwards. Like you mentioned, beam searching is slow (because each beam is effectively an inference stream) so I tend to stick to simpler sampling for evaluations just because of resource constraints.

SebastianBodza commented 1 year ago

Thanks for the clarification! I ran some tests locally. I guess it is rather related to the repetition penalty. Without repetition_penalty and repeat_last_n.

Python Passed 85 of 91
JavaScript Passed 75 of 91

However it seems to also be a bit unstable. Another run with the same settings:

Python Passed 88 of 91
JavaScript Passed 82 of 91

For the beam_size i think you are right. num_hypotheses should be correct.

the-crypt-keeper commented 1 year ago

@SebastianBodza Yes something seems to be wrong with the implementation of repeat penalty in this runtime, but I haven't yet dived into the code to see whats up. This isnt normally a complex operation.

If you want to try it on something with repeat penalty that should be otherwise stable, that's the goal of params/greedy.json.

the-crypt-keeper commented 1 year ago

I've implemented batching and basic stop-seq support for this runtime, but batching seems to only make the instability problems here worse :/

I wonder if upstream issue #1425 is related and we have some unstable sort related issues happening here..

guillaumekln commented 1 year ago

Hi,

The issue related to the callback in batch mode should be fixed in ctranslate2>=3.19.0. The returned batch_ids were mixed up.

However, I'm not sure what is the issue with repetition penalty. For now I suggest forcing this value to 1 for CTranslate2 if this value works for you. In general repetition penalty should not be needed when using a random sampler.

the-crypt-keeper commented 9 months ago

@guillaumekln I am having trouble with this runtime following upgrade of my container to CUDA 12.1, it complains of RuntimeError: Library libcublas.so.11 is not found or cannot be loaded

Does ct2 only support CUDA 11 at this time?