Open mbutterick opened 2 years ago
(I’d also be willing to consider a GitHub alternative like GitLab, though they also seem to be heading down the AI rabbit hole.)
I know nothing about SourceHut, but it seems quite popular as an alternative to Git{Hub/Lab}.
• For those who have firsthand experience with running a self-hosted Git server — pros? Cons? Is there a solution for CI?
If you have a server that is reliably up, then users probably
won't notice much difference at the git level. We would give
up Github-specific features, but I try not to use those anyway.
I cannot comment on the CI issue with any generality.
• For everyone — what would be the negative impact on you if pollen (and all my other software) were moved to this hypothetical new server? I would not be changing the license or anything else. Just the server where the canonical source is hosted.
There would be no negative impact on me.
• If that happened, I would be open to leaving pollen-users here at GitHub. Though I don’t have a strong feeling either way.
I don’t have a strong feeling, either. I do like having a mailing
list.
I completely understand your concerns. As a university prof, I
am not at all happy with Copilot as a free service available to
students. At least now, pre-Copilot, students have to do the work
of finding someone else's code to copy and modify.
---- Eugene
As for why. (Not that it matters.) I was willing to reserve judgment after the Microsoft acquisition of GitHub. Since then I have found myself holding my nose at most of the so-called improvements.
This week I tried Copilot, which is the most putrid yet — you install a keylogger Visual Studio plugin on your machine and get terrible code in return. It seems inevitable that in the same way social-media sites rapidly evolved into funnels for personal data to be sold to advertisers, the main business of GitHub will be collecting code for their AI training and other collateral purposes.
I also think that Copilot is a massive violation of the open-source licenses I use. I further question whether I can even meaningfully comply with the open-source licenses of underlying software I incorporate while hosting code on GitHub, because I’m feeding that code into the maw of something that will violate the license. (No, I don’t literally expect to be sued, but the handwaving around these issues is far from encouraging.)
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.
I second the SourceHut recommendation, though I don't (yet) use it myself. It's run by Drew Devault, who agrees with you about Github and GitLab, and it includes CI.
Drew Devault, who agrees with you about Github and GitLab
I see that he wrote this week about Copilot as a form of “open source laundering”. I generally agree with his argument, though I think his suggested solutions are unworkable:
Allow GitHub users and repositories to opt-out of being incorporated into the model. Better, allow them to opt-in. Do not tie this flag into unrelated projects like Software Heritage and the Internet Archive.
This idea is similar to the GDPR’s “right to be forgotten”. But it’s impossible to retroactively remove code that has already been incorporated into the model without retraining it from scratch. Also, I expect there would be a negative-selection effect where owners of better code would be more likely to opt out, thereby making the model dumber. (Though it wouldn’t surprise me if private enterprise repos have been exempted from the model so far, and opting out will be a service sold to them later on.)
Track the software licenses which are incorporated into the model and inform users of their obligations with respect to those licenses.
Impossible. First, the material emitted by the model comes from different places and there’s no guarantee that the licenses are legally compatible. Second, this would put Microsoft in the position of giving legal advice to zillions of users. They are already passing that buck (about which more below)
Remove copyleft code from the model entirely, unless you want to make the model and its support code free software as well.
Impossible. Without copyleft code, the model would starve to death.
Consider compensating the copyright owners of free software projects incorporated into the model with a margin from the Copilot usage fees, in exchange for a license permitting this use.
I’m sure Microsoft’s view is that the owners of the projects are being compensated already with all the goodies on GitHub that they don’t pay market rates for. The “license permitting this use” is already baked into the GitHub terms of service.
Even assuming that training an AI with certain software code counts as fair use under US copyright (as GitHub’s former CEO has claimed), that’s a long way from claiming that every output of that system also qualifies as fair use. Microsoft has not made this claim — and will not, because they can’t guarantee the behavior of a probabilistic system — so they explicitly pass this risk onto Copilot users:
We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn't write yourself. These precautions include rigorous testing, IP scanning …
Therefore, the good news (?) — I expect that Copilot will be banned in most companies due to the possibility of some junior engineer nonchalantly embedding IP violations in the enterprise codebase.
In the meantime, Microsoft’s fair-use argument creates a bigger problem. Devault suggests a nuclear option: “don’t use GitHub and your code will not make it into the model”. That’s also what I’m proposing here. But if AI training qualifies as fair use for code that appears on GitHub, it qualifies as fair use for code that appears anywhere. Just as Google indexes all the web pages, GitHub could train its AI on the code displayed on GitLab, Gogs, Gitea, etc. The cynical endpoint of this line of thinking is that one might as well leave code on GitHub because it’s going to be absorbed anyhow.
Ironically the other likely outcome of Copilot is a surge of Copilot-generated code being released by the world’s laziest programmers. This tsunami of idiocy will crash again on GitHub’s shores, where it will be reabsorbed into the model, creating a process heretofore unknown in computer science: recursive stupidity.
What an accomplishment.
[I am not anyone’s lawyer and no one should take this comment as legal advice.]
Bradley Kuhn of Software Freedom Conservancy on the ramifications of AI for open-source software. Bradley reaches several of the same points, though with more factual & legal detail.
I cannot speak knowledgeably about point 1, but for 2 and 3:
I use sourcehut for personal projects and have been quite happy with it. It is unobtrusive and seems to have all the features I need without any chaff. It offers a mailing list service too, which works nicely from email though is kinda minimal as a web forum.
Thank you for the suggestions. I have cloned Pollen to Sourcehut and changed the canonical repo on the Racket package server:
https://git.sr.ht/~mbutterick/pollen
Apparently this would become the new mailing list:
https://lists.sr.ht/~mbutterick/pollen-discuss
I invite Sourcehut fans to inspect this repo & either flag any mistakes or make suggestions for improvement before the switch is thrown.
After that, I suppose the right move is to put the GitHub Pollen repo into “archived” mode.
My further thoughts on the legality of Copilot and the (perhaps) futility of avoiding its maw, though ethics count too.
Sourcehut uses Git over HTTP. Racket added support for these URLs in version 8.1 with a private git+https
prefix. Regardless of the wisdom of this workaround, because Pollen (and my other Racket packages) support versions of Racket before 8.1, AFAICT I need a source-hosting service that supports traditional .git
URLs. As it stands, users of versions of Racket before 8.1 will not be able to install Pollen.
In the meantime I have reverted the package server to use the GitHub repo (see #132)
I have cloned pollen to Codeberg and changed the canonical repo on the Racket package server:
Swift and Racket (among others) use Discourse. I’m thinking of putting up a self-hosted instance as a replacement for pollen-users
. Pros/cons from those who have fiddled with it?
(So far Codeberg seems to be cooperating with the Racket package server, so I plan to stick with it. But it seems wise to permanently divorce the talking-about-software functionality from the Git hosting.)
2 questions:
or
Does Codeberg provide feature similar to this we use here on GitHub for pollen-users?
Yes, Codeberg also has an “issues” feature. So in principle we could make a pollen-users
over there. (But all the current messages would be left behind.) I suppose increasingly I lean toward separating the two tasks. It’s easier to relocate a Git repo than a discussion system. (Pollen originally had a mailing list hosted by Google, which was shut down abruptly, which is how we ended up here.)
Why not use the mailing list at https://lists.sr.ht/~mbutterick/pollen-discuss?
I considered that. If I’m going to have the discussion list hosted elsewhere, I’d rather a) host it myself using b) an open-source system with a track record, and c) do it in a way that allows me to consolidate other discussions (related to Quad, Beautiful Racket, etc.) because all those projects will be leaving GitHub too.
I’ve put up a Discourse server at https://forums.matthewbutterick.com with an area for Pollen discussion. I invite members of pollen-users
to inspect this server. Absent any objections or unforeseen wrinkles, I will put pollen-users
into read-only mode by the end of July 2022 and we will move the party to the new server.
Codeberg is just a hosted Gitea instance. So I thought: why not just put up my own Gitea server, if I could get it working in 30 min or less. I could and I did.
https://git.matthewbutterick.com/mbutterick/pollen
It would be possible to migrate pollen-users
to this server, sort of. The thread messages would be migrated to a new repo. But they wouldn’t be attributed to users on the new server. Still, because pollen-users
was always something of an off-label use of GitHub, there isn’t much reason to persist with that idiosyncrasy, now that there is a Discourse server.
For those who have firsthand experience with running a self-hosted Git server — pros? Cons? Is there a solution for CI?
For everyone — what would be the negative impact on you if
pollen
(and all my other software) were moved to this hypothetical new server? I would not be changing the license or anything else. Just the server where the canonical source is hosted.If that happened, I would be open to leaving
pollen-users
here at GitHub. Though I don’t have a strong feeling either way.As for why. (Not that it matters.) I was willing to reserve judgment after the Microsoft acquisition of GitHub. Since then I have found myself holding my nose at most of the so-called improvements.
This week I tried Copilot, which is the most putrid yet — you install a ~keylogger~ Visual Studio plugin on your machine and get terrible code in return. It seems inevitable that in the same way social-media sites rapidly evolved into funnels for personal data to be sold to advertisers, the main business of GitHub will be collecting code for their AI training and other collateral purposes.
I also think that Copilot is a massive violation of the open-source licenses I use. I further question whether I can even meaningfully comply with the open-source licenses of underlying software I incorporate while hosting code on GitHub, because I’m feeding that code into the maw of something that will violate the license. (No, I don’t literally expect to be sued, but the handwaving around these issues is far from encouraging.)