Open olegklimov opened 1 year ago
/bounty $2000
π $2,000 bounty created by olegklimov
π If you start working on this, comment /attempt #77
to notify everyone
π To claim this bounty, submit a pull request that includes the text /claim #77
somewhere in its body
π Before proceeding, please make sure you can receive payouts in your country
π΅ Payment arrives in your account 2-5 days after the bounty is rewarded
π― You keep 100% of the bounty award
π Thank you for contributing to smallcloudai/refact!
Attempt | Started (GMT+0) | Solution |
---|---|---|
π΄ @Akshay-Patel-dev | Aug 25, 2023, 11:44:51 PM | WIP |
π’ @shobhit9957 | Aug 26, 2023, 10:38:57 AM | WIP |
π’ @benxh1995 | Sep 4, 2023, 11:51:23 PM | WIP |
π’ @ds5t5 | Sep 25, 2023, 1:52:54 AM | #122 |
/attempt #77
/attempt #77 hey @olegklimov I would like to contribute.. can you please provide some more description about this project. I'm a beginner here...
Note: The user @Akshay-Patel-dev is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @Akshay-Patel-dev will complete the issue first, and be awarded the bounty. We recommend discussing with @Akshay-Patel-dev and potentially collaborating on the same solution versus creating an alternate solution.
I'm a beginner here...
You can start with installing it and trying out.
But unless you already familiar with CPU inference libraries and LLMs in general, it might take you quite a long time to research.
I forked the project. And performed steps in the contributing.md file, but getting errors and unable to run it locally.
I added this , because in the error I encountered, this has to be added. install_requires=[ "triton>=12 0.0.3", ] in setup.py file, do you think adding this would be in the main branch is necessary?
CPU project names: ggml, ctransformers
/attempt #77
I've got a preliminary version working with ctransformers. Inference on my M1 Mac for Starcoder is almost impossibly slow. The Refact-1.6b model still doesn't have GGUF or GGML versions available. Any attempts to make my own quants have failed using the official quantization scripts.
I can have a codellama FIM 7B demo up and running soon.
Note: The user @shobhit9957 is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @shobhit9957 will complete the issue first, and be awarded the bounty. We recommend discussing with @shobhit9957 and potentially collaborating on the same solution versus creating an alternate solution.
An interesting link: https://github.com/ggerganov/llama.cpp/discussions/2948 -- how to convert HuggingFace model to GGUF format
Example of GGUFs of all sizes: https://huggingface.co/TheBloke/Llama-2-7B-GGUF
@olegklimov
If this is still open, I might try it out.
Would the bounty claim still count for model conversion to GGUF format?
I understand it's first come, first serve. I'm just wondering if you're looking for a conversion script or if you just want general CPU support?
Quantization is a bit different from CPU inferencing and I'm just looking for clarity on the scope.
If you just want quantization, then I can look into creating a conversion script and I'll submit an attempt if I get it working and this is still open.
Hi @teleprint-me
Someone is trying the heavy lifting here: https://github.com/ggerganov/llama.cpp/issues/3061
@olegklimov
Yes, I saw that. That's why I'm asking.
I know that in order to do it, one would need to use the GGUF library to convert the tensors.
It would require a custom script, like the others that already exist in the llama.cpp repository.
Your original request was in reference to the inference_hf.py
script which is why I was asking for clarification.
@teleprint-me We are moving away from server-side scratchpads, in favor of client-side scratchpads. The plugins that can do it should land next week or a week after. There still has to be a script that takes the tasks to do, using completions_wait_batch()
(in inference_worker.py) and streams the results, but only a simple left-to-right completion will be required soon.
In short, the requirement "Script similar to inference_hf.py" can now read "Script similar to inference_hf.py, but only /v1/completions needs to work".
Script to test:
curl http://127.0.0.1:8008/v1/completions -k \
-H 'Content-Type: application/json' \
-d '{
"model": "smallcloudai/Refact-1_6b-fim",
"prompt": "def hello_world():\n \"\"\"\n This function prints \"Hello World!!!\" and brews coffee.\n \"\"\"",
"stream": true,
"echo": false,
"stop": ["\n\n"],
"temperature": 0.8,
"max_tokens": 50
}'
Stream and not stream should work, CPU output should be the same as current GPU output -- sounds like a well defined criterion.
@olegklimov
That's exactly what I was looking for, thank you for the update.
I'll be reviewing the other open bounties in the coming days as well.
Currently, I'm setting up a custom OS for my new workstation and finalizing the prototype interface for my personal assistant.
If I make significant progress that aligns with the criteria for any of the outstanding bounties, I'll submit an attempt and, if appropriate, a subsequent PR.
Given that I'm working against a deadline, I'm highly motivated to contribute efficiently and effectively.
/attempt https://github.com/smallcloudai/refact/issues/77
π‘ @ds5t5 submitted a pull request that claims the bounty. You can visit your org dashboard to reward. π @ds5t5: To receive payouts, sign up on Algora, link your Github account and connect with Stripe on your dashboard.
Testing this:
./main -m ./Refact-1_6B-fim/ggml-model-f16.gguf -n 300 -p "write a function to multiple two integers in python" --temp 1.0 --top-p 1.0 --top-k 1 --repeat_penalty 1.0
I see speed:
Xeon 5315Y | Threads -t N | speed tokens/s |
---|---|---|
-t 2 | 6 | |
-t 4 | 11 | |
-t 8 | 11 | |
-t 16 | 4 |
M1 doesn't depend on threads.
First token, 551 prompt:
I'd say that's the main problem for adoption of this. 551-token prompt isn't even that big, normally we have about 1950 tokens.
I tried Starcoder 1b, converted by TabbyML:
https://huggingface.co/TabbyML/StarCoder-1B/tree/main/ggml
"-m", "starcoder-1b-q8_0.gguf",
897.71 ms / 557 tokens ( 1.61 ms per token, 620.47 tokens per second)
1334.68 ms / 49 runs ( 27.24 ms per token, 36.71 tokens per second)
"-m", "./starcoder-1b-f16.gguf",
841.99 ms / 557 tokens ( 1.51 ms per token, 661.53 tokens per second)
243.18 ms / 49 runs ( 45.78 ms per token, 21.84 tokens per second)
"-m", "./Refact-1_6B-fim/ggml-model-f16.gguf",
175.27 ms / 557 tokens ( 2.11 ms per token, 473.93 tokens per second)
962.51 ms / 49 runs ( 60.46 ms per token, 16.54 tokens per second)
@olegklimov I think it has to do with the conversion process. They're looking into it. Typically the smaller models are much faster in llama.cpp.
@olegklimov
- MacBook Air M1
Try the 4-bit model, you should see a performance boost compared to the 16-bit model.
4-bit
llama_print_timings: load time = 45.88 ms
llama_print_timings: sample time = 3.91 ms / 300 runs ( 0.01 ms per token, 76706.72 tokens per second)
llama_print_timings: prompt eval time = 56.82 ms / 9 tokens ( 6.31 ms per token, 158.38 tokens per second)
llama_print_timings: eval time = 6762.85 ms / 299 runs ( 22.62 ms per token, 44.21 tokens per second)
llama_print_timings: total time = 6933.22 ms
8-bit
llama_print_timings: load time = 71.79 ms
llama_print_timings: sample time = 3.72 ms / 300 runs ( 0.01 ms per token, 80623.49 tokens per second)
llama_print_timings: prompt eval time = 54.23 ms / 9 tokens ( 6.03 ms per token, 165.94 tokens per second)
llama_print_timings: eval time = 11387.12 ms / 299 runs ( 38.08 ms per token, 26.26 tokens per second)
llama_print_timings: total time = 11553.91 ms
16-bit
llama_print_timings: load time = 5828.46 ms
llama_print_timings: sample time = 4.17 ms / 300 runs ( 0.01 ms per token, 71856.29 tokens per second)
llama_print_timings: prompt eval time = 72.36 ms / 9 tokens ( 8.04 ms per token, 124.38 tokens per second)
llama_print_timings: eval time = 20573.06 ms / 299 runs ( 68.81 ms per token, 14.53 tokens per second)
llama_print_timings: total time = 20760.76 ms
Performance between the 16-bit and 32-bit converted tensor formats will perform the about the same on lower-end hardware.
Also, llama.cpp is still working on FIM implementation.
Quants are between 2-bit and 16-bit and support k-bit implementations if you aren't too familiar with the library or quant types.
OK it works nicely! So all the credit goes to @ds5t5, right?
@teleprint-me oh I see you've converted the 1.6b model in several quantizations, thank you for that! (I thought your tests were for llama, the name is confusing)
@ds5t5 Hi there!
We are going to slightly change modelling and weights respectively at the HF. The changes will include:
attn.k
and attn.v
into attn.kv
mlp.linear_1
and mlp.linear_3
into mlp.gate_up_proj
Guess we need to update https://github.com/ggerganov/llama.cpp/pull/3329 as well
thanks. let me know when it is ready for model weight. i will rebase my llama.cpp PR to the latest branch of llama.cpp.
@JegernOUTT can i ask why we decided to make the weight change? it seems not quite aligned with other popular models. they (falcon, llama) usually keep mlp.linear_1 and mlp.linear_3 separately. while for attention, it is usually qkv or q/k/v. only the original gpt2 model uses kv as one.
@ds5t5 We've updated the weights
We are using different inference backends in refact
and when we train LORA models, we are struggling with modelling differences. So, we've decided to make these changes to the model and synchronize the implementation everywhere rather than keep some "hacks"
@JegernOUTT it seems like the latest push breaks the
tokenizer = AutoTokenizer.from_pretrained("smallcloudai/Refact-1_6B-fim")
@ds5t5 what problem do you have? I've just checked it and found no issues
nvm. i removed my cache and it works
I'm working on a mod to get HF refact model to run on CPU since I don't have a working GPU backend at the moment. Not too many changes either and I just need to get the server running.
Also working on a refact template for llama-cpp-python for inference in refact, so it would just be plug in and play. This wouldn't work until @ds5t5's downstream changes make it into llama-cpp-python though.
Hopefully I'll have it done by the end of this weekend.
@teleprint-me We were thinking more along the lines of bundling llama.cpp with our rust binary, linked together. The rust binary is shipped with our next get plugins, such as VS Code. This might allow for a much lower cost of installation for the end user: no docker, nothing to install, no strange packages in local python, nothing to run separately or care about.
The largest problem is prompt prefill, about 4 seconds for 2048 tokens, on Apple M1. That's a bit too long for interactive use.
So I asked in llama.cpp what people think about architecture more suitable for CPU or M1, here https://github.com/ggerganov/llama.cpp/discussions/3395 . We can train a new model so it prefills prompt faster, we have the data and the GPUs!
Or maybe M2 will fix the speed :joy: (I didn't try yet).
@olegklimov
Alright βΊοΈ No worries! After reviewing the code and attempting to come up with a minimalistic solution, this sounds like a better path forward if I'm being honest. You should probably mark this as solved. @ds5t5 definitely got this one.
i have updated the converter in the PR in llama.cpp based on the latest revision in huggingface hub. It looks like the llama.cpp community wants to wait for a few PRs to be merged before Refact PR is officially merged. i see another 5-10% performance boost after my change to the latest commit of llama.cpp. @olegklimov
@ds5t5: Your claim has been rewarded! We'll notify you once it is processed.
ππ @ds5t5 has been awarded $2,000! ππ
The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm?
The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm?
any updates?
Yes, we'll release bring-your-own-key in a few days
Yes, we'll release bring-your-own-key in a few days
Bring your own key is there, but the docker container still doesn't work on an M1.
You are right, it doesn't. Other servers do though, you can help us if you test it!
There are several projects aiming to make inference on CPU efficient.
The first part is research:
inference_hf.py
does it (needs a callback that streams output and allows to stop),Please finish the first part, get a "go-ahead" for the second part.
The second part is implementation:
inference_hf.py
,