gpt-engineer-org / gpt-engineer

Platform to experiment with the AI Software Engineer. Terminal based. NOTE: Very different from https://gptengineer.app
MIT License
52.54k stars 6.83k forks source link

Privacy concern: user data is being sneakily collected #415

Closed homedirectory closed 1 year ago

homedirectory commented 1 year ago

GPT Engineer collects user data, namely user prompts among other metadata. This fact is not mentioned in the README, nor is it mentioned that you can opt out by setting the COLLECT_LEARNINGS_OPT_OUT environment variable. The ToS link in the README is broken. Therefore, the only way for users to be aware of their data being sneakily collected is to read the code.

I consider this to be a violation of users' privacy, and propose one of the following to be implemented as soon as possible:

https://github.com/AntonOsika/gpt-engineer/blob/0596b07a39c2c99c46509c17660f5c8aef4b2114/gpt_engineer/collect.py#L25

barshag commented 1 year ago

what is the easy way to disable it? (beside comments those lines..)

homedirectory commented 1 year ago

@barshag

you can opt out by setting the COLLECT_LEARNINGS_OPT_OUT environment variable

Gamekiller48 commented 1 year ago

Created a pull request: https://github.com/AntonOsika/gpt-engineer/pull/423

AntonOsika commented 1 year ago

Thanks for raising concern!

Especially the broken link to terms of use. Fixing asap.

The terms are very short so should be easy to read.

As for prompts being recorded: This was discussed publicly in discord #general during the weekend. Since openai is doing this already we did not see an issue.

Could anyone steelman the argument here for why a 20 page openai ToS stating that data is collected is fine but the terms of use here are not explicit enough?

Gamekiller48 commented 1 year ago
  1. OpenAI is a service, gpt-engineer is a piece of software. That services collect data is somewhat expected, people know that their data will end up collected by using the service, even without reading the ToS. The same does not apply to software. People assume that software (especially open source software) does not collect data if not immediately necessary for the functionality of the software.
  2. Some people will not use OpenAI with this repo, but connect it to offline models to run in, for example, the oobabooga webui using their OpenedAI API endpoint. If you set up the project like this, the reasonable expectation would be to have a privacy-first offline instance of this project. Having this setup then secretly communicate prompts to some feedback collector would be unexpected behaviour.
  3. "[...] why a 20 page openai ToS stating that data is collected is fine but the terms of use here are not explicit enough?" -> there are no ToS for this repo, or they aren't linked in the readme.
  4. GDPR compliance: not a lawyer, but for EU users you have to keep in mind the legally binding principle of data avoidance. If the data isn't necessary to run the service, you're not allowed to collect it without getting people to agree to it. Opt-out isn't enough, it must be explicitely opt-in for website-cookies for example. OpenAI can argue that their data collection is necessary for their service and also explicitely receive permission people setting up accounts on their site. The same does not apply to this repo.
cheekybastard commented 1 year ago

Stealth addition of MITM spyware would constitute an issue under the code of conduct or in any professional setting when dealing with likely commercial in confidence; trade secrets, IP, internal processes, etc.

https://github.com/AntonOsika/gpt-engineer/blob/main/.github/CODE_OF_CONDUCT.md Examples of unacceptable behavior include: ... Other conduct which could reasonably be considered inappropriate in a professional setting

AntonOsika commented 1 year ago

I appreciate the open discussion about this.

For the bulk of the issue I'm definitely to blame here:

We wrote a terms of use, explaining the data collection, and linked to it from the README when performing the telemetry update.

I rushed getting this out, and did not add the terms of use to version control.

We do not have any "automatic tests" for broken links, as we have for code, and it was merged before I caught it.

Many parts of the negative reaction, apart for my huge blunder with the broken link (reaction on this is warranted) stand out to me as overly polarised against what is pretty standard product analytics. Two main reasons:

  1. A core part of the application is that the user is explicitly prompted to share if the code worked or not for "learning" purposes. Based on that this is the user flow, it should be expected that it is stored.

  2. A lot of care is put into not sending or storing any private date (IP, use agent, similar). Very happy to have this fact audited.

I am committed to doing what is best for the community here. This means striking the right balance of not invading privacy and building a useful tool. Without getting feedback, per default, on how well this tool works for users it is very difficult to do a good job on improving it.

My experience before this issue was created was that: very few people are protective of sharing their prompts with external services (consider all the GPT chrome extension etc out there, where I'm many also share IP, fingerprint, etc).

Conclusions

Appreciate your contribution @Gamekiller48 and everyone. I know everyone here wants what is best for the users.

I will merge your PR @Gamekiller48 (opt out -> opt in PR).

If someone, in addition, could make a PR to make the "CLI review flow" ask "is OK to send data", that would be great. (I'm super busy at the moment and not able to write code, would look into it myself otherwise)

Furthermore, as a follow-up here, I will ensure there is further review on what is the right policy for an application like this, and that there is an informed decision with input from experts and those with opinions from different sides.

I will post in this issue again with the final conclusions, including if we decide to change the stance on data collection. In this way everyone subscribed to this issue will become notified.

Personal note Some comments, such as "stealth addition" (the data collection updates were announced) and "MITM spyware" weigh on me and take much of the joy out of trying to measure the capability and continuously improve the project.

I ask from the community to get support in building something useful and open source, and get constructive contributions. Such as PRs to address concerns and improve the tool for everyone (some in this thread are role models here).

cmsimike commented 1 year ago

When using the OpenAI models through the API (as this project does), OpenAI explicitly does not collect user data to further train their models: https://openai.com/policies/api-data-usage-policies

OpenAI will not use data submitted by customers via our API to train or improve our models, unless you explicitly decide to share your data with us for this purpose.

Though they do retain data for 30 days in case of abuse and misuse, which is completely different from retaining data to improve the service:

Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).

zerofill commented 1 year ago

I don't know if I would call it sneaky. I mean this repo has had so many things changed in the past 30 days, it is easy for a broken link to get overlooked.

AntonOsika commented 1 year ago

471 is merged to explicitly ask for consent