skishore / makemeahanzi

Free, open-source Chinese character data
https://www.skishore.me/makemeahanzi/
Other
1.83k stars 465 forks source link

Remove svgs from source control and create releases instead #50

Open T-vK opened 5 years ago

T-vK commented 5 years ago

This git repository is enormously huge due to the fact that all svg files are stored in it. Thus a simple git clone takes a very very long time. It is generally considered bad practice to commit binaries and such. My suggestion: Remove the svg files from git and upload them to the release section instead.

skishore commented 5 years ago

I'm a bit confused about that release section. The releases on that page are just tags on certain commits on the git repository, so it sounds like that would still involve checking in the released values. Where would the releases go instead?

Also, I think the clone performance issue is not due to the released code but due to my checking in full database dumps on the tool branch. Commits on the master branch have large absolute file sizes but very small diffs - for instance, graphics.txt has one line per character, so the deltas between different commits should be very small. The database binaries on the tool branch don't have this nice property.

There's another step this piece, which is to automate the creation of the release from the tool branch. Right now, here's the process I use:

  1. Backup and commit the character data: Meteor.call('backup');
  2. Create the dictionary.txt and graphics.txt files: Meteor.call('export');
  3. Create the SVG files: Meteor.call('exportSVGs');
  4. Copy the files over to the master branch.
  5. (With your still graphics:) regenerate the still SVGs.
  6. Make a commit on master.
T-vK commented 5 years ago

You can upload binaries when creating a release. Take a look at this article: https://blog.github.com/2013-07-02-release-your-software/

My advice: Only put source code in the git repo, set up a CI service like Travis or Gitlab CI, let it automatically run the scripts to generate the SVGs etc and then let it create a new release that includes the generated files.

Storing SVGs might not be as bad as storing actual binary files, but still every single time that these SVGs get modified it will immensely increase the repo size because instead of committing one change to a script, we will commit over 9000 changes. And this will get worse and worse over time. The sooner we stop this the better. Why do you store database dumps in the tool branch anyway? If it's really necessary, we should store it in a non-binary way. Or we could just store it completely outside of the repo and write a script that downloads it when its needed.

I also don't see why you should have different branches for the tools and the master. That seems like a very unusual use-case. Branches are typically used to commit code that is not yet ready to get into the master branch... and for Github pages.

If you're interested in setting up a CI service, I can help you out with that. I have a lot of experience with Gitlab CI.

skishore commented 5 years ago

Alright, this situation is definitely getting worse as I make more fixes. Something will have to be done. The repo is now around 1.6Gb on my computer, mainly due to the checked in database dumps...

On the other hand, I do think it's worth keeping the database state in the repository. After all, it's the main output of the project - I've spent much longer actually using the tool than writing it. If the data is formatted correctly it also diffs very cleanly, which makes it easy to see each data change.

What do you think of burning "glyphs.bson" out of the repo history and replacing it with a "glyphs.txt" file containing the database rows, with one line per row? That will diff nicely and so it won't take up much space, either - just one copy of the main file plus a lot of small deltas.

Regarding CI, thank you for your offer, but I want to avoid spending time building out tooling that won't end up saving that much time. I think some minimal release automation script might provide 90% of the benefits at much less work. (I am not at all sure about that, because I don't know what setting up custom CI entails - everywhere I've ever worked had existing CI that worked well.)

T-vK commented 5 years ago

What do you think of burning "glyphs.bson" out of the repo history and replacing it with a "glyphs.txt" file containing the database rows, with one line per row?

Yes, that sounds like a good idea. CSV might be a good file format for that. You would store one table per CSV file in that case.