Running code generation on your own API Docs

mikolodz commented 1 year ago

@paolorechia Did you consider using something ready to use like this dataset to finetune the models: https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca

I'm working on a similar project currently (as you described in the closed issue), where llm generates python code using my API. The code is being evaluated and executed on the source request server. It works using GPT-3.5-turbo and now i'm solving the puzzle to make it working using an open model. Biggest problems? 2k tokens are not enough to fit the docs (API docs) within the prompt. Lanchain on the other hand should be useful so if the right embedding method is used, the right snippets of the documentation should be passed within the prompt. But there is still quite a chance that context won't be fed using right chunks from the db but something random instead.. I already faced that lately and i'm not sure how much it depends on the scheme of the docs.

I consider fine tuning, just as you described, but it would take couple weeks to prepare the dataset properly on your own data. Also, when I extend the API I will have to train again which takes another couple hours and the results are not predictable every single time.

So I believe that the best approach would be to finetune some simple model like Wizard 7B or wizard-vicuna-13b (getting the best results so far) to write the python code properly (using external, ready dataset), and then pass the right chunks of the API Docs in the context (langchain style).

What do you think about it?

paolorechia commented 1 year ago

Very interesting, thanks for sharing the dataset.

I’ve been generating my own data, did the first fine tuning today - though it was quite dirty so the results are still low quality. I’ll take a look at the dataset you shared though.

About your use case: have you looked into the StarCoder or the instruct version? https://huggingface.co/HuggingFaceH4/starchat-alpha

The StarCoder has a context window of 8k, so maybe the instruct also does. Would that be enough for you? The downside is that it’s 16b parameters, BUT there’s a gptq fork to quantize it. It’s probably one of the next things I would try next too on my end, as it also has an Apache 2.0 license, so I could use it commercially :)

mikolodz commented 1 year ago

Interesting! Thank you! Do you mean something specific when referring to the instruction following version? I tried StarCoder (4 bit, mayank31398_starcoder-GPTQ-4bit-128g) and it doesn't want to follow instructions.

Zrzut ekranu z 2023-05-13 20-55-59

paolorechia commented 1 year ago

@mikolodz Yeah, I meant the starchat-alpha, which is the star code fine tuned with a instruction dataset (shared a link to the HF in previous comment). Not sure if it’s easy to find it quantized already. Another option is giving star code a big prompt to do an in context training, though I found it painfully slow then. See the prompt from HuggingFace: https://huggingface.co/datasets/bigcode/ta-prompt

mikolodz commented 1 year ago

@paolorechia Thanks for the link, I have used this prompt before I posted my previous comment (although I copied it from the paper and could mess something up). Tried again to make sure I write the prompt correctly and it doesn't seem to perform any better than other models. I tried the same task couple of times with a different result each time. Nothing working of course, and only 2/4 tries generated any code.

Starchat on the other hand looks much more promissing, but no 4bit available yet. Definitly worth trying when it's out! :)

Example 1:

# here is the prompt above
Human: Write me a python script which fetches the website content (including all sub-pages). Clean text is parsed from the content and stored inside output.txt file. Url is given as an argument to the script.

Assistant:  Sure thing. Please note that cleaning HTML tags is outside scope of our services currently. However, you could definitely build something similar based off of https://github.com/mozilla/bleach.

We recommend reading the documentation before starting coding though :)

-----
"""

Example 2:

# here is the prompt above
Human: Write me a python script which fetches the website content (including all sub-pages). Clean text is parsed from the content and stored inside output.txt file. Url is given as an argument to the script.

Assistant: Sounds good. If you look closely enough, you'll notice that the url ends with “/”, indicating that it points to root directory of a web site. Your job is simple – fetch everything underneath this page including images etc., parse clean text from html files and save it somewhere locally. Then send email containing contents of local txt file alongside link to original webpage URL.

Therefore, our task becomes simply getting raw HTML source code over HTTP GET request, parsing plain text from response body, saving cleaned text to disk, attaching the saved file to outgoing mail message and sending it off.

To keep things simple, I'd recommend starting with urllib library provided standard lib.

BTW. On my setup It's bit slow, but not that much. It doesn't load properly with monkey_patch, so it's about 2-3x slower than other 13B models concerning the huge starting prompt @ ~10 tokens/sec.

paolorechia commented 1 year ago

If you have the hardware time, you might be able to quantize it yourself using this: https://github.com/mayank31398/GPTQ-for-SantaCoder

Not sure though!

Another model I've never tried out: https://github.com/BlinkDL/RWKV-LM

Apparently it can handle longer context windows since it's actually RNN! The README is a bit confusing, but seems to mention an example with a context window of length 8k tokens.

With this one you would probably need fine tuning though.

paolorechia / learn-langchain

Running code generation on your own API Docs #29