supermaven-inc / supermaven-nvim

The official Neovim plugin for Supermaven
https://supermaven.com/
MIT License
543 stars 27 forks source link

supermaven-nvim sends the entire buffer to the server even when ignore_filetypes is configured to skip the file #85

Open cfal opened 3 weeks ago

cfal commented 3 weeks ago

supermaven-nvim adds a TextChanged autocmd here which calls binary:on_update https://github.com/supermaven-inc/supermaven-nvim/blob/d71257f431e190d9236d7f30da4c2d659389e91f/lua/supermaven-nvim/document_listener.lua#L20

BinaryLifecycle:on_update sends everything to stdin which i assume ends up writing to the server (it's a closed source binary that is fetched so i can't easily check): https://github.com/supermaven-inc/supermaven-nvim/blob/d71257f431e190d9236d7f30da4c2d659389e91f/lua/supermaven-nvim/binary/binary_handler.lua#L77

this code path never seems to hit poll_once which is the only place where ignore_filetypes seems to be checked: https://github.com/supermaven-inc/supermaven-nvim/blob/d71257f431e190d9236d7f30da4c2d659389e91f/lua/supermaven-nvim/binary/binary_handler.lua#L293

it seems misleading that ignore_filetypes doesn't actually ignore files of that filetype and instead will send everything in every buffer backed by a file.

cfal commented 3 weeks ago

seems like this is a dupe of https://github.com/supermaven-inc/supermaven-nvim/pull/35/files from 3 months ago which isn't even merged. what an incredible lack of urgency for a huge privacy issue.

sm-victorw commented 3 weeks ago

https://github.com/supermaven-inc/supermaven-nvim/pull/35/files does not address the issue being raised here, as the sm-agent binary automatically includes files in the git repo as part of the context, even if they are not opened.

If a file contains sensitive information it should be included in .gitignore, as Supermaven does not send .gitignore files to the server even if they are opened. Alternatively, you could include a .supermavenignore and globs specified in that file will also not be sent to the server.

This isn't clear from documentation, so I think we could change that...

cfal commented 3 weeks ago

https://github.com/supermaven-inc/supermaven-nvim/pull/35/files does not address the issue being raised here, as the sm-agent binary automatically includes files in the git repo as part of the context, even if they are not opened.

this also needs to be clearly documented.

If a file contains sensitive information it should be included in .gitignore, as Supermaven does not send .gitignore files to the server even if they are opened. Alternatively, you could include a .supermavenignore and globs specified in that file will also not be sent to the server.

this is untenable for large or internal repos. imo there should be a way (allowlist and blocklist) to configure which repos to enable.

ahmedelgabri commented 3 weeks ago

this is untenable for large or internal repos. imo there should be a way (allowlist and blocklist) to configure which repos to enable.

I think this https://github.com/supermaven-inc/supermaven-nvim/pull/58 could solve it in a programmable way. Check the path of the file, and disable supermaven when needed. Because ignore_filetypes is very limited. But again, was open for 2 months and not merged yet.

sm-victorw commented 3 weeks ago

I've merged both PRs mentioned, as they are useful in their own rights, and could seemingly help address some of the privacy concerns here, though as I mentioned earlier these don't address the underlying issue involving sm-agent, which .supermavenignore was intended to solve

GitMurf commented 3 weeks ago

...though as I mentioned earlier these don't address the underlying issue involving sm-agent, which .supermavenignore was intended to solve

Thank you for confirming because I was wondering the same thing. I believe your point is that the full context of the repository is sent at startup via the sm binary which has nothing to do with the neovim plugin / config? And the only way to prevent things is either in the .gitignore or .supermavenignore as the binary respects those by default out of the box (regardless of anything in the neovim plugin).

Do I have this correct @sm-victorw ?

GitMurf commented 3 weeks ago

@sm-victorw this brings up two further questions I have been wondering about:

  1. is there any command we can run to see exactly what supermaven is using as context and has sent to the servers?

    • If not, this would be a wonderful command to add to the neovim plugin to be able to log out all the files sent / in context
    • This would give users confidence / peace of mind on what is actually being sent
    • And to help them test their configuration to make sure it is doing what they want.
  2. what if I am in a github repo (cwd) in neovim but open up a buffer with a file from outside the repo? See example:

Thanks in advance for clearing these things up!

sm-victorw commented 3 weeks ago

Thank you for confirming because I was wondering the same thing. I believe your point is that the full context of the repository is sent at startup via the sm binary which has nothing to do with the neovim plugin / config? And the only way to prevent things is either in the .gitignore or .supermavenignore as the binary respects those by default out of the box (regardless of anything in the neovim plugin).

Yes this is roughly what is happening, though depending on how large the repository is, the context might not include everything. Also note that the context is kept on the server for up to 7 days, as mentioned in the code policy (https://supermaven.com/code-policy)

is there any command we can run to see exactly what supermaven is using as context and has sent to the servers?

There isn't any way currently to see exactly what is being included in the context, if you are interested in what files are eligible to be included the sm-agent binary, typically located at $HOME/.supermaven/binary/[version]/[platform]-[arch]/sm-agent can be run with the list-files command to see what isn't being ignored. e.g. ./sm-agent list-files /path/to/repo

If you are interested in whether or not a file is being ignored, ./sm-agent check-ignore /path/to/file can be used as well

what if I am in a github repo (cwd) in neovim but open up a buffer with a file from outside the repo?

Whenever a file is sent to the binary, the only .gitignore/.supermavenignore considered are the ones inside the repository of the file in question. If you have multiple buffers open they could potentially be following different .gitignore rules. The .env in your scenario would be uploaded if it isn't part of a git repository. In general files which are not part of a git repository are uploaded when they are edited, with no additional context included.

The lack of control for non-git files is unfortunate, and should have a robust solution. ignore_filetypes was not intended for this use, and until now wasn't meant to be a privacy related feature. Ideally we will have an allow/blocklist of some kind that does not make this sort of determination based on file type.

GitMurf commented 3 weeks ago

Thank you for answering all my questions. Exactly what I needed.

I think the biggest "risk" are the files outside of the git repo. Personal markdown notes, internal docs etc.

Is this something handled in the nvim plugin? If so I wonder if for the time being a super conservative approach of just prompting the user in nvim for any file outside of the git repository asking if they want it uploaded? Since typically these will just be one off files opened up ad-hoc.

Another option that would be nice is a config option to just blanket disable uploading any files outside the git repo (if that's possible).

ahmedelgabri commented 3 weeks ago

I think the biggest "risk" are the files outside of the git repo. Personal markdown notes, internal docs etc.

Wouldn't a single .supermavenignore in $HOME solve this?

GitMurf commented 3 weeks ago

@ahmedelgabri thanks for the response.

  1. The .env (is just a common example) or any other sensitive info is not always going to be from the same repo root that I am currently cwd at. Often times I am flipping between repos and have to open up common files that would not be under that particular git repo. Based on the response above, it is only covered if the files in the .gitignore are actually within that repo.

  2. On windows it is not common to have your files (like notes etc.) under your "HOME" (I put in quotes because we don't really have a HOME ;) .... it is usually something like USERPROFILE) ... documents / notes are often not under that "HOME" path. But even if they were, I don't know that supermaven is looking that far up the tree looking for a supermaven ignore?

Is there any official documentation on using supermavenignore?

leet0rz commented 3 weeks ago

Is there a way for supermaven to just not do this in the first place out of the box or does it have to have this behavior? No one wants their personal information leaked.

sm-victorw commented 3 weeks ago

@leet0rz Could you specify which behavior you are referring to? The uploading of non-repository files? Or the repository based indexing that the binary performs?

We could probably give the option to have the plugin disabled by default, and require a call to the api (.start()) before the binary is ever started, or something similar to this. I'm not sure if that's what you're proposing

leet0rz commented 3 weeks ago

@leet0rz Could you specify which behavior you are referring to? The uploading of non-repository files? Or the repository based indexing that the binary performs?

We could probably give the option to have the plugin disabled by default, and require a call to the api (.start()) before the binary is ever started, or something similar to this. I'm not sure if that's what you're proposing

I mean not entirely sure how this works but this does seem like a major privacy concern, as stated before obviously people will run this in all sorts of notes and would never want their personal information uploaded or leaked in any way and supermaven should not be uploading this sort of information in any way to anything ever. What I heard is that it uploads the entire buffer and I guess sources or creates information or "AI responses" or inputs that we can accept from that? If that is the case, is it possible to do this locally instead of uploading it (which is the privacy concern).

I hope I am doing an ok job explaining this and have actually understood what's going on?

GitMurf commented 3 weeks ago

@leet0rz the power comes from uploading. Most laptops are not powerful enough to do the type of processing it does and even if it could our laptops would be burning up high cpu/gpu/ram resources constantly. Also to be clear, this is how most of these AI code tools work including GitHub copilot. The difference is Supermaven is more powerful sending your entire repository to its models (more context). None of those things are the main problem. The main problem really is files that are not in your git repository but that you open in a buffer because those also are being sent up to the servers.

GitMurf commented 3 weeks ago

We could probably give the option to have the plugin disabled by default, and require a call to the api (.start()) before the binary is ever started...

@sm-victorw I think this would be great as step 1. But I think the other important thing should be changing the default of any files that are not part of your current opened git repository should be opt-in instead of opt-out. By default files outside git repo are not sent to servers unless you white list them... preferably a glob / glob array, or even better a callback function we can configure to return true if we want a file sent to servers (with the file path as an input parameter to the cb function).

Thoughts?

leet0rz commented 3 weeks ago

@leet0rz the power comes from uploading. Most laptops are not powerful enough to do the type of processing it does and even if it could our laptops would be burning up high cpu/gpu/ram resources constantly. Also to be clear, this is how most of these AI code tools work including GitHub copilot. The difference is Supermaven is more powerful sending your entire repository to its models (more context). None of those things are the main problem. The main problem really is files that are not in your git repository but that you open in a buffer because those also are being sent up to the servers.

What about usage outside of github when you just use neovim to open personal files, which a lot of us do. Will that still not upload the entire buffer and cause a privacy concern? I mean I use neovim to open any file I want to edit outside of github related things too and if a file with sensitive information I open out of some text document and with supermaven being enabled by default will that not cause said privacy concern?

GitMurf commented 3 weeks ago

@leet0rz yes that is the concern we have been discussing in this thread. It is definitely a concern. I was just explaining why the idea of doing anything local just on your machine is not an option.

sm-victorw commented 3 weeks ago

@leet0rz Yes, both the pull requests mentioned earlier in this issue can help mitigate this issue, but as I mentioned earlier we are going to want a robust and clear approach for letting users specify which files they would like to exclude

leet0rz commented 3 weeks ago

@GitMurf @sm-victorw Cool thanks guys.

dzirtusss commented 2 weeks ago

Another side of the problem is, that if I create some temporary file, I should first update .gitignore and then can start doing something.

I mean, normally, it is the opposite - I work in project local directory which is "safe", and only when commiting, think what should be commited and what should be gitignored and what should be deleted.

I mean now, if I create any temporary and/or scratch file with some probable secret inside the repo folder, even when nvim runs in different window, e.g. as a script output (I usually do some script > 1.txt) it will be uploaded to supermaven. And supermaven will "like" that file because it is fresh.

Which is even a worth problem, because many tools "expect" to run from project folder to pick up configuration.

Atm, I think I might do:

# .supermavenignore
*
!*.js
!*.jsx
...

This at least might prevent some surprizes.

dzirtusss commented 2 weeks ago

As well what might be useful - a GLOBAL IGNORE, somewhere in ~/.supermaven. Which will be a system-wide set of rules followed by a binary despite if a file is in a git or not in a git repo. Maybe local supermavenignores should override it, maybe not.

dreson4 commented 18 hours ago

I have seen this issue again and again. I stopped using it for a while as it's a big issue. I have files in .gitignore it works well on some projects on some it doesn't care simply sends everything. On VSCode it works much better compared to other IDEs, this problem happens frequently on Jetbrains IDEs. I'm using Goland, you just have to pray for it to skip sometimes. On VSCode it almost always skips

leet0rz commented 14 hours ago

I have seen this issue again and again. I stopped using it for a while as it's a big issue. I have files in .gitignore it works well on some projects on some it doesn't care simply sends everything. On VSCode it works much better compared to other IDEs, this problem happens frequently on Jetbrains IDEs. I'm using Goland, you just have to pray for it to skip sometimes. On VSCode it almost always skips

For me the issue is having to add files to ignore, I don't want to do that. I want non-code files to be ignored by default. I don't want to keep track of and ignoring every file except for my code files, that should be default behavior if it's not.

sm-victorw commented 9 hours ago

I have seen this issue again and again. I stopped using it for a while as it's a big issue. I have files in .gitignore it works well on some projects on some it doesn't care simply sends everything. On VSCode it works much better compared to other IDEs, this problem happens frequently on Jetbrains IDEs. I'm using Goland, you just have to pray for it to skip sometimes. On VSCode it almost always skips

Can you elaborate on what you mean it 'skips'? As in you get completions on files which are included in .gitignore? The intellij and neovim plugins are not responsible for deciding what is or isn't sent to the server, this is determined by the binary sm-agent which makes that determination based on the file path and any .gitignore it finds. Until somewhat recently all of these plugins used the same binary so the behavior shouldn't have been different

dzirtusss commented 9 hours ago

There is a way to guarantee that binary does use only permitted files on MacOS via sandboxing. This is a native OS feature, thus highly secure and only couple text files needed.

How to do:

  1. create a wrapper for the agent somewhere, e.g.:

    #!/bin/sh
    sandbox-exec -f /.../supermaven.sb /.../.supermaven/binary/v15/macosx-aarch64/sm-agent "$@"
  2. create a policy

    
    (version 1)
    (allow default)

(deny file-read) (allow file-read (literal "/")) (allow file-read (subpath "/System/Volumes/Preboot/Cryptexes/OS")) (allow file-read (subpath "/dev")) (allow file-read (subpath "/Library/Preferences")) (allow file-read (subpath "/usr/share/icu")) (allow file-read (subpath "/private/var/db/timezone")) (allow file-read (subpath "/var"))

(allow file-read* (subpath "/Users/sergey/.supermaven"))

(allow file-read-metadata (subpath "/Users/sergey/projects"))

(allow file-read (regex #"/.git/")) (allow file-read (regex #"/.gitignore$")) (allow file-read* (regex #"/.supermavenignore$"))

(allow file-read (regex #".rb")) (allow file-read (regex #".lua"))



Here first pack is needed to start binary correctly (including all shared system libs), then read its own folder, then read ignores and restrict to ruby/lua.

3. Fork plugin and replace binary to a wrapper (or if you don't wanna fork, use other ways e.g. links)

---

This ^^^ is a fully working template, which I wanted to improve, but don't have time atm. Thus decided to post it AS IS that somebody may have pick it up. When/if I will have more time to work on this, will post updated version.

Beauty of this way, is that compliance is guaranteed by OS sandboxing (at least for binary), plugin is another story it may send whatever directly.

Definitely system libs restrictions should be fine-tuned more, but overall I don't care that much about that part, as this is "normal binary way" something, doesn't relate much to personal sensitive info.