stop committing binary assets (images) to this repository

slimsag commented 5 years ago

We are committing images to this repository:

https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+file:png

We will one day regret this (maybe not tomorrow, but in a few years) and have to rewrite our git history.

A better and still simple option would be hosting these all somewhere else, like a GCS bucket

keegancsmith commented 5 years ago

images are rather small. Right now our .git is 187M. Saying that large assets can do a lot of damage, but png/jpg are normally fine.

keegancsmith commented 5 years ago

Here are our biggest objects in git:

8717b79ae0e7  799KiB vendor/golang.org/x/text/encoding/japanese/tables.go
d909e38e5e05  801KiB vendor/golang.org/x/text/encoding/traditionalchinese/tables.go
415f52a11160  858KiB vendor/golang.org/x/text/encoding/simplifiedchinese/tables.go
7b218e660082  871KiB doc/user/code_intelligence/img/CodeReview.gif
cfbaa1529c19  1.1MiB cmd/frontend/internal/app/assets/img/Homepage/homepage-mobile.svg
d2375cdaa857  1.5MiB vendor/github.com/ericchiang/k8s/apis/core/v1/generated.pb.go
86e02b492d53  4.4MiB doc/integration/img/BrowserExtension.gif
fa3194e86b69  4.9MiB doc/integration/img/SearchShortcut.gif
3f381f42e428   42MiB cmd/frontend/internal/app/assets/distassets_vfsdata.go

Looks like someone once accidentally committed the output of go-bindata to master and pushed.

sqs commented 5 years ago

The fix here (I think) is to add a script that checks for large assets and fails CI if any are found.

slimsag commented 5 years ago

I do not think the problem is "large assets" in specific, but rather "large assets that change often".

The Go files (excluding generated vfsdata ones) are not going to change often, I would expect maybe once a year, maybe less.

But I would not be surprised if we update these gifs/pngs every few months, and we're going to pay the cost of storing that diff forever.

sqs commented 5 years ago

Cool. Please propose a solution. Like the following?

Remove all current PNG and similar files, move them to object storage somewhere
Add a check in CI to prevent people from readding any such files

BTW most(/all?) of the vfsdata files actually should not be committed. I have a PR that removes a big one at https://github.com/sourcegraph/sourcegraph/pull/2237/commits/0c46930bcfaed4d56dd16c6aa49222dbb83e6225.

slimsag commented 5 years ago

Yes, that is exactly my proposal (I should've reiterated that, my apologies)

beyang commented 5 years ago

FYI we'd like to start adding more images into the frontend (e.g., https://app.zeplin.io/project/5bfee4b96dc22f705c06d0a7/screen/5c5b5202d9f763354701d84b).

keegancsmith commented 5 years ago

I agree large assets are a problem. But things that are at most a few megs (some images) surely isn't that bad given the convenience of tracking it in source control? As it stands, they can't be too big or it will result in a poor user experience for loading a page. So is this issue about large assets (ie we set some upper limit that requires whitelisting to allow merging), or just binary assets in general?

slimsag commented 5 years ago

But things that are at most a few megs (some images) surely isn't that bad given the convenience of tracking it in source control?

My worry is our repo ends up being several GB in a few years and someone says "you know what? we screwed up back then and should filter all of those our of our git history (and now we can't really git bisect, oh well)."

Avoiding this by using a storage bucket online seems like a quick and easy way to avoid this potential outcome in the next few years.

I am fine with any option which prevents large binary files that change often (usually our image assets):

Forbid anything with an image extension.
Forbid any file > 16kb.
- This solution has the negative side effect of forbidding one-off large files which do not change often, so I prefer it less.

Just from some back-of-the-napkin math:

Every time we add/update/change an image in our repository we're roughly committing the equivalent of 25-120 Go/TS code files.
If we change 10 images a month on average we're talking about 2-10MB/month, or 12-120MB a year. (of course, predicting the actual number of images is hard because we now have docs, browser extension, and other code in this repo which are likely to require shipping more images regularly alongside our code.)

francisschmaltz commented 5 years ago

To chime in: I try to rely on SVGs for images so I can add “images”as dom elements but I’m some cases that’s not possible. The onboarding image is about 60KB. Not nothing, but something that might be beneficial to store in git as it doesn’t change.

Marketing images are a different story. As we move some of the marketing and docs pages into sourcegraph it might be a good idea to just link to those from about.sourcegraph.com for the short term.

tsenart commented 5 years ago

Anyone tried https://git-lfs.github.com/ before?

slimsag commented 5 years ago

I have, it's good when you have a substantial number of large assets to track. I am unsure if it would be better than a simple script that downloads assets from gcs in our case. It does require installing the git extension and running a Git LFS server somewhere, which is a large burden for small cases like ours I think.

On Fri, Feb 22, 2019, 2:02 AM Tomás Senart notifications@github.com wrote:

Anyone tried https://git-lfs.github.com/ before?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sourcegraph/sourcegraph/issues/1952#issuecomment-466325056, or mute the thread https://github.com/notifications/unsubscribe-auth/ADBrOINfYIKJ77IODDnVBcBfS4cp20VLks5vP7IggaJpZM4aHQsr .

tsenart commented 5 years ago

and running a Git LFS server somewhere, which is a large burden for small cases like ours I think.

I thought that Github would provide this for us.

sqs commented 5 years ago

@ryan-blunden You committed a lot of large GIFs in acf095236377ece4c0092847ce891c3f5dace56a and other commits to doc/integration/img. We want to avoid doing this for frequently changing large binary assets because it will make our repository very large.

Images under 100 KB are OK per the doc guide (https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/doc/dev/documentation/style_guide.md#images). Also see that link for instructions on how to upload large assets to Google Cloud Storage to avoid committing them.

ryan-blunden commented 5 years ago

I've removed those gifs in #3076. Should we rewrite git history so they are removed from the repo altogether?

ryan-blunden commented 5 years ago

@slimsag Given @sqs' doc styleguide addition for large images, can this now be closed?

slimsag commented 5 years ago

No.

It's awesome that we are now enforcing it culturally :) But we need something to enforce it like a git push hook in the long haul

slimsag commented 5 years ago

(also, yeah, rewriting git history is bad so no on that)

mrnugget commented 4 years ago

I've just been pointed to this ticket here, because I've added images to the docs in https://github.com/sourcegraph/sourcegraph/pull/9961.

Since the link to "doc styleguide addition for large images" doesn't work anymore and I cannot find anything about images in our current style guide I did the following:

Create a docs/images/ folder in our sourcegraph-assets bucket.
Move the images from #9961 there and link to them from markdown.

Before that I've also ran the images through ImageOptim.app.

Now, here are a few thoughts on this topic:

We have 48 images (*.jpg, *.png, *.gif, *.svg, without *.svg it's 28) in our doc folder. That's not much. But I do think that we should probably add more in the future, to improve our docs.
Together they take up 7.3MB (find doc -iname "*.png" -or -iname ".jpeg" -or -iname "*.gif" -or -iname "*.svg" | xargs du -ch).
They don't change that often.
Committing them to the repository has the benefit of having the docs right next to the images that are included. In my experience, that reduces the risk of having dead links etc.
The other benefit is that by committing them it's super easy to package the correct images with the correct version of the docs. With a public bucket we need to do versioning ourselves, probably by adding a <major version>.<minor version>.<patch version> to the path so we don't run into cache problems.

I do get why we want to avoid committing binary files, but considering points 1-3 I don't think we're even close to "lots of large, often-changing binary files" territory and when looking at the "free benefits" we get from committing the assets (points 4-5) I'm not sure the tradeoff is worht it (yet).

keegancsmith commented 4 years ago

Together they take up 7.3MB (find doc -iname "*.png" -or -iname ".jpeg" -or -iname "*.gif" -or -iname "*.svg" | xargs du -ch).

This is acceptable. The issue comes in when people start committing large files often. We don't really have anything to prevent that other than people knowing they shouldn't do it. This won't scale as our organisation grows.

I agree though, it is far more convenient to have everything together. But maybe there is some very simple tooling that exists so we can use externally managed assets easily? If it is very low friction that seems reasonable.

tsenart commented 4 years ago

Can’t we use Github’s LFS feature for this?

Sent via Superhuman iOS ( https://sprh.mn/?vip=tomas@sourcegraph.com )

On Mon, Apr 20 2020 at 10:04 AM, Keegan Carruthers-Smith < notifications@github.com > wrote:

Together they take up 7.3MB (find doc -iname "*.png" -or -iname ".jpeg" -or -iname "*.gif" -or -iname "*.svg" | xargs du -ch).

This is acceptable. The issue comes in when people start committing large files often. We don't really have anything to prevent that other than people knowing they shouldn't do it. This won't scale as our organisation grows.

I agree though, it is far more convenient to have everything together. But maybe there is some very simple tooling that exists so we can use externally managed assets easily? If it is very low friction that seems reasonable.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub ( https://github.com/sourcegraph/sourcegraph/issues/1952#issuecomment-616381140 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAAQPD5FNOZPNHFVGYHILQTRNP6ZBANCNFSM4GQ5BMVQ ).

unknwon commented 4 years ago

5. The other benefit is that by committing them it's super easy to package the correct images with the correct version of the docs. With a public bucket we need to do versioning ourselves, probably by adding a <major version>.<minor version>.<patch version> to the path so we don't run into cache problems.

Totally agree with this point, and also vote for Git LFS, we could start using GitHub's LFS infrastructure and it's not a lock-in as LFS backend can be arbitrary HTTP server that speaks LFS protocol. Migration also possible as one-off effort AFAIK, without changing the Git history, though never actually did (but I recently implemented a LFS server).

The one little drawback I can think of is, our docsite needs to be updated to work with cloning Git repository, downloading ZIP archive will no longer work (how it works today).

sourcegraph / sourcegraph-public-snapshot

stop committing binary assets (images) to this repository #1952