kantord / SeaGOAT

local-first semantic code search engine
https://kantord.github.io/SeaGOAT/
MIT License
970 stars 62 forks source link

KeyError on like 212 in engine.py on request. #226

Closed benbot closed 9 months ago

benbot commented 1 year ago

Just installed it on my work laptop (running macos)

Server usually crashes once I make a request on like 212 in engine.py complaining about a KeyError on one of the files.

I had it working one time on my 3rd try starting the server. Not sure I did anything different though

I can't post the log here unfortunately :(

kantord commented 1 year ago

Can you please help me reproduce this error by sharing a little bit more information.

Also I'm curious if it only crashed on one specific repository, or if it crashes for everything

BreakTheBeta commented 1 year ago

I'm getting the same issue.

M2 Max pipx installed seagoat, version 0.28.0 Tinygrad/tinygrad repo.

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 79, in _worker_function
    self._handle_task(context, task)
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 66, in _handle_task
    result = handler(context, *task.args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 83, in handle_query
    results = context["seagoat_engine"].get_results(kwargs["limit_clue"])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 209, in get_results
    sorted(
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 213, in <lambda>
    + 0.3 * normalize_file_position(top_files[x.path])
                                    ~~~~~~~~~^^^^^^^^
KeyError: 'state.py'

Printing out the top__files: {'tensor.py': -0.6955769938501167}

kantord commented 1 year ago

I'm getting the same issue.

M2 Max pipx installed seagoat, version 0.28.0 Tinygrad/tinygrad repo.

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 79, in _worker_function
    self._handle_task(context, task)
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 66, in _handle_task
    result = handler(context, *task.args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 83, in handle_query
    results = context["seagoat_engine"].get_results(kwargs["limit_clue"])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 209, in get_results
    sorted(
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 213, in <lambda>
    + 0.3 * normalize_file_position(top_files[x.path])
                                    ~~~~~~~~~^^^^^^^^
KeyError: 'state.py'

Printing out the top__files: {'tensor.py': -0.6955769938501167}

regarding this, just out of curiosity, is the file state.py gitignored? Or perhaps it's a new file that has not been committed yet?

Just trying to figure out why it would not be included in top_files as that is generated based on git history

BreakTheBeta commented 1 year ago

I'm getting the same issue. M2 Max pipx installed seagoat, version 0.28.0 Tinygrad/tinygrad repo.

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 79, in _worker_function
    self._handle_task(context, task)
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 66, in _handle_task
    result = handler(context, *task.args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 83, in handle_query
    results = context["seagoat_engine"].get_results(kwargs["limit_clue"])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 209, in get_results
    sorted(
  File "/Users/user/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 213, in <lambda>
    + 0.3 * normalize_file_position(top_files[x.path])
                                    ~~~~~~~~~^^^^^^^^
KeyError: 'state.py'

Printing out the top__files: {'tensor.py': -0.6955769938501167}

regarding this, just out of curiosity, is the file state.py gitignored? Or perhaps it's a new file that has not been committed yet?

Just trying to figure out why it would not be included in top_files as that is generated based on git history

Repo I'm using: https://github.com/tinygrad/tinygrad

Running server in ..../tinygrad folder

The state.py is not gitignored

benbot commented 1 year ago

Just had the crash happen again in https://github.com/Oneirocom/Magick/

This time the server wasn't finished processing all the chunks (60K) but this was the same error on the other project which was finished processing everything.

Magick is a large js project and the other was a medium sized java project.

Also this time i'm on Arch Linux. So this is happening at least on Arch and macos.

  File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/home/benbot/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 79, in _worker_function
    self._handle_task(context, task)
  File "/home/benbot/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 66, in _handle_task
    result = handler(context, *task.args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benbot/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 83, in handle_query
    results = context["seagoat_engine"].get_results(kwargs["limit_clue"])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benbot/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 208, in get_results
    sorted(
  File "/home/benbot/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 212, in <lambda>
    + 0.3 * normalize_file_position(top_files[x.path])
                                    ~~~~~~~~~^^^^^^^^
KeyError: 'packages/@types/rete-connection-reroute-plugin.d.ts'
benbot commented 1 year ago

that file isn't in the .gitignore either

janeshchhabra commented 1 year ago

Hitting this on mac as well on a file which is in not in gitignore.

I am doing it one level into the folder, not from root, so there is that.

yubrshen commented 1 year ago

I might also got the KeyError, here is the trace:

`Analyzing source code: 0it [00:00, ?it/s] 2023-09-22 08:57:07,014 Analyzed the minimum number of chunks needed to operate. 2023-09-22 08:57:07,014 Analyzed all chunks! 2023-09-22 08:57:07,014 Handling task: query /home/yshen/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████████████████████████████| 79.3M/79.3M [00:07<00:00, 10.6MiB/s] Exception in thread Thread-1 (_worker_function): Traceback (most recent call last): File "/home/yshen/miniconda3/envs/seagoat-python311/lib/python3.11/threading.py", line 1038, in _bootstrap_inner self.run() File "/home/yshen/miniconda3/envs/seagoat-python311/lib/python3.11/threading.py", line 975, in run self._target(*self._args, *self._kwargs) File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 79, in _worker_function self._handle_task(context, task) File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 66, in _handle_task result = handler(context, task.args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 83, in handle_query results = context["seagoat_engine"].get_results(kwargs["limit_clue"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 208, in get_results sorted( File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 212, in

The file as the key for the KeyError is indeed a file in code base.

I just started SeaGOAT a minutes before, then I type: > gt "sourcetypes" and got the above error and trace.

I might need to wait for longer time, even after the server finish scanning the code base?

I'm running in Ubuntu 24.4, in WSL2/Window 11. The files complained of KeyError is not tracked by git. but in the same repo, the same error also happended with a file tracked by git, not ignored:

  File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 212, in <lambda>
    + 0.3 * normalize_file_position(top_files[x.path])
                                    ~~~~~~~~~^^^^^^^^

I'll try a different repo.

kantord commented 1 year ago

I might need to wait for longer time, even after the server finish scanning the code base?

No, that should not be necessary at all!

yubrshen commented 1 year ago

What is the expectation to a repo to be working with gt?

kantord commented 1 year ago

What is the expectation to a repo to be working with gt?

  • Must be a git repository
  • All files must be checked-in, not ignored, all committed?

It just needs to be a git repository. Even if there are no files that are actually committed, it should still work. Actually by design it even works with files that you have just recently created.

kantord commented 1 year ago

I have a suspicion that

KeyError on like 212 in engine.py on request.

has to do with 2 competing versions of the file existing somehow, or maybe the file no longer has the line that it was last analyzed with. I think that this would be solved by grouping the results by SHA1 hash and using git to retrieve the correct version of the file

I suspect this is a different error, I have only one theory for it which is maybe a result appears through ripgrep, but it is not anywhere in git history. Maybe there is a bug that files that have not been committed yet are not included in top_files, but that would only be possible if the file is not in any previous commit :thinking:

    + 0.3 * normalize_file_position(top_files[x.path])
                                    ~~~~~~~~~^^^^^^^^
KeyError: 'packages/@types/rete-connection-reroute-plugin.d.ts'
elephanter commented 1 year ago

find out that error is because of x.path is lowercase but key inside top_files has uppercase symbol. I think that goes from repository class, where processed commit on files, that line

if not (self.path / filename).exists():
    continue

Perhaps I renamed that file from uppercase. I'm not checked, but people say that .exists() on mac works case insensitive. So I get method from here and replace .exists() https://stackoverflow.com/questions/6710511/case-sensitive-path-comparison-in-python

Now I got same error. but that file with uppercase is not in the top_files hash anymore, but current lowercase file not in there too, but it is in results and failing here again.

elephanter commented 1 year ago

temporarily fixed that error with changing to

return list(
                sorted(
                    results_to_sort,
                    key=lambda x: (
                        0.7 * normalize_score(x.get_best_score(self.query_string))
                        + 0.3 * normalize_file_position(top_files.get(Path(x.path).as_posix(), 0))
                    ),
                )
kantord commented 1 year ago

find out that error is because of x.path is lowercase but key inside top_files has uppercase symbol. I think that goes from repository class, where processed commit on files, that line

if not (self.path / filename).exists():
    continue

Perhaps I renamed that file from uppercase. I'm not checked, but people say that .exists() on mac works case insensitive. So I get method from here and replace .exists() https://stackoverflow.com/questions/6710511/case-sensitive-path-comparison-in-python

Now I got same error. but that file with uppercase is not in the top_files hash anymore, but current lowercase file not in there too, but it is in results and failing here again.

I noticed that one way this error can happen is if a file is found the ripgrep before the repo was analyzed. This can happen if you create a file while the server is analyzing files, and then make a query before all files are analyzed. That is because the server is not looking for more files to analyze while there are still files in the queue.

But I'm curious if the same error can happen in other circumstances as well :thinking:

kantord commented 1 year ago

Reopening because only the error regarding files not being found was fixed, the error regarding lines not being found probably still persists