sourcegraph / cody

Type less, code more: Cody is an AI code assistant that uses advanced search and codebase context to help you write and fix code.
https://cody.dev
Apache License 2.0
2.72k stars 298 forks source link

bug: Ollama yields tiny context and pure hallucination. #5611

Open enn-nafnlaus opened 1 month ago

enn-nafnlaus commented 1 month ago

Version

v1.34.3

Describe the bug

Cody seems to be working fine normally. However, I recently started ollama to run local models, in particular, deepseek coder v2 lite (q4). However, it seems like the context window is ridiculously short,and the output is pure hallucination.

1) It only lets me include the first 172 lines of one file. A word counter says it's 597 words long. Eve accounting for some long "words" (like a couple regexstatements), this is clearly no more than a couple thousand tokens. Deepseek Coder v2 Lite supports 128k tokens context. Are you not setting the context size in the API, and just using the default?

2) Unlike with e.g. Claude, what I get back has absolutely nothing to do with the provided context. Presented with the first 100 lines of a Python mapping program and an instruction "Add debug", it writes:

`deepseek coder v2:latest <|begin▁of▁sentence|># 1. 题目 Given a binary tree, find its minimum depth. The minimum depth is the number of nodes along the shortest path from the root node down to the nearest leaf node. Note: A leaf is a node with no children.

Example:

Given binary tree [3,9,20,null,null,15,7], 3 / \ 9 20 / \ 15 7 return its minimum depth = 2.

Copy

Apply

  1. 解答 2.1 思路 这道题是求二叉树的最小深度。我们可以使用递归来解决这个问题。对于每一个节点,我们检查它的左子树和右子树的深度,然后取其中的最小值加1(因为当前节点也算一层)。如果一个节点没有左子树或右子树,那么我们就只考虑有子树的那一边。

2.2 代码实现

Definition for a binary tree node.

class TreeNode: def init(self, val=0, left=None, right=None): self.val = val self.left = left self.right = right

def minDepth(root: TreeNode) -> int: if not root: return 0

# If the node is a leaf, its depth is 1
if not root.left and not root.right:
    return 1

# Initialize the minimum depth to a large number
min_depth = float('inf')`

It should go without saying that this has absolutely nothing to do with the code (which is not about binary trees and has nothing to do with the Chinese language), nor anything to do with the instruction to add debug.

Expected behavior

It should allow use of the full 128k context window and the outputs should at least remotely pertain to the task.

Additional context

No response

PriNova commented 1 month ago

Hey @enn-nafnlaus

Do you use your custom model as autocomplete or in the chat? For the latter, configure your Ollama dev model as follows:

"cody.dev.models": [
    {
        "provider": "ollama",
        "model": "your_model",
        "inputTokens": 65000,
        <additional_configs like apiEndpoint etc.>
    }
]

Here the inputTokens is important to set correctly.

enn-nafnlaus commented 1 month ago

Thanks for the response, and sorry for the delay (had to deal with taxes). :) Your solution fixed context length (after a restart), but it's still pure hallucination. E.g.:

image

Just to demonstrate that ollama and the model works locally:

image

(I'd also add that it's frustrating that ollama is the only available local server, given that it's not like APIs between different local servers are radically different. Ollama gives you so little control over how the model is run vs, say, llama.cpp. Like, I can'control cross-card memoroyallocation (it's not even spanning multiple GPUs), batches, run it with speculative decoding, and on and on.)

createchange commented 4 weeks ago

I will add that I experience the hallucination issue with this family of models as well. However, using other local models, I have no such issue.

deepseek-coder-v2:lite:

image

deepseek-coder:6.7b, I killed the hallucination - it will run on indefinitely:

image

qwen2.5-coder:7b works just fine:

image

Seems like some models only like being code completion, and others are better for analyzing code. I haven't found a model to run locally that can serve as both.

I am a layperson, but maybe this will also be useful info...

If I run ollama serve (as opposed to running it in the background), I notice this output appears for the model in question:

check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
[GIN] 2024/10/12 - 05:51:01 | 200 |  547.416542ms |       127.0.0.1 | POST     "/api/generate"
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
[GIN] 2024/10/12 - 05:51:01 | 200 |  568.794375ms |       127.0.0.1 | POST     "/api/generate"
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
[GIN] 2024/10/12 - 05:51:01 | 200 |  448.858958ms |       127.0.0.1 | POST     "/api/generate"

I see no such thing for other models that behave well.