spdx / tools

SPDX Tools
Apache License 2.0
126 stars 69 forks source link

spdx-tools repository is 119MiB #193

Closed vlsi closed 5 years ago

vlsi commented 5 years ago

It makes sense to remove jars from the repository so it is simpler to use.

Top jar consumers are:

0b100e14b830 1203860 lib/xercesImpl-2.7.1.jar
d107c0f3b0cd 1501575 lib/guava-10.0.1.jar
398ef942cbf1 1506473 lib/jena-2.6.3-tests.jar
9ff92b021c77 1516415 lib/poi-3.5-FINAL-20090928.jar
fe04fd66e3da 1680523 lib/arq-2.8.4.jar
a21fe3dd3729 1765707 lib-source/commons-lang3-3.1-javadoc.jar
edc0ee59b8b2 1820323 lib/poi-3.8-20120326.jar
47a67dca6037 1900385 lib/jena-2.6.3.jar
ef40629be7ad 2319126 lib-source/jena-2.6.3-sources.jar
9c985c7d5acd 2388361 lib/antlr-3.4-complete.jar
ccd8163421ba 2666695 lib/xmlbeans-2.3.0.jar
f5e8c167e7f7 3233439 lib/icu4j-3.4.4.jar
86251cc60d49 3471911 Jenna-2.6.3/jena-2.6.3-javadoc.jar
3f951d9c751e 3728517 lib/saxon8.jar
9283af18b3b8 14003759 ApachePOI/ooxml-schemas-1.0.jar
89ee82ed226d 54043450 lib-source/poi-src-3.8-20120326.jar

https://rtyley.github.io/bfg-repo-cleaner/ can help with file removal:

java -jar bfg.jar --delete-files '*.jar' results in 20MiB repository (10x reduction).

goneall commented 5 years ago

I just pushed a cleaned up repo. I did get some errors for the pull request related refs:

! [remote rejected]   refs/pull/1/head -> refs/pull/1/head (deny updating a hidden ref)
 ! [remote rejected]   refs/pull/1/merge -> refs/pull/1/merge (deny updating a hidden ref)
...

@vlsi let me know if this comes across clean for you. I do have a backup of the old repo if we need to restore it.

vlsi commented 5 years ago

That was fast. There were the following folders as well: Jenna-2.6.3, commons-lang-2.3 do you need them?

For instance: java -jar bfg.jar --delete-folders Jenna-2.6.3, java -jar bfg.jar --delete-folders commons-lang-2.3

Note: by default BFG does not keep reference to "former commit ids" when deleting files (it thinks the files contain private info, so it hides previous commits to avoid someone brute-forcing the data).

You might use my BFG release: https://github.com/vlsi/bfg-repo-cleaner/releases which happens to have --no-private flag so you could have "former-commit-id" header (see sample here: https://github.com/apache/jmeter/commit/7561325be56c0481488da4d0307885611017acb6 )

vlsi commented 5 years ago

PS. which software to you use to produce LICENSE.spdx / SPDXParser.spdx? It looks like every time you save the file its contents is randomly shuffled, so git thinks you are creating multimegabyte file "from scratch".

vlsi commented 5 years ago

I did get some errors for the pull request related refs:

That is expected. GitHub does not allow to update pull/* refs.

goneall commented 5 years ago

Since it looks like my rewriting history tripped up ORT (heremaps/oss-review-toolkit) and it is now a reasonable size - I think I'll just leave the 2 unused Jenna and commons lang folders.

@vlsi Thanks for the suggestion and info.

goneall commented 5 years ago

PS. which software to you use to produce LICENSE.spdx / SPDXParser.spdx? It looks like every time you save the file its contents is randomly shuffled, so git thinks you are creating multimegabyte file "from scratch".

This is produced from the SPDX Maven Plugin. Since it is in RDF format and the Jena libraries do not preserve any order, it gets completely regenerated. I'm thinking it should be removed from the source directory entirely and only store in the release artifacts including Maven Central, Bintray and the Github release artifacts rather than keeping it in a directory under source control.

I'll open a new issue for this.

vlsi commented 5 years ago

I'm thinking it should be removed from the source directory entirely and only store in the release artifacts including Maven Central

Please do that if that is not required (+remove from historical commits).

Jena libraries do not preserve any order, it gets completely regenerated

I think it is valid to raise an issue to Jena (or SPDX Maven Plugin) to add explicit ordering, so the build artifacts could be reproducible.

mnonnenmacher commented 5 years ago

Please do that if that is not required (+remove from historical commits).

Please do not rewrite the history again, it's a bad practice for public repositories. @vlsi: You didn't provide a rationale why the repository size is an issue. If it's about clone performance e.g. on CI, why not use a shallow clone?

vlsi commented 5 years ago

If it's about clone performance

Clone performance for testing purposes (e.g. running tests). The rest operations like commit are impacted as well because git requires GC from time to time which is impacted by the repo size.

There's a side concern as well: disk space for all involved parties. The ones who clone, Travis, GitHub, etc, etc.

it's a bad practice for public repositories

By the way, you didn't provide a rationale why this specific repo must not be rebased.

mnonnenmacher commented 5 years ago

Clone performance for testing purposes (e.g. running tests).

For this, like suggested, you can use shallow clones, because you usually don't need the commit history to run tests. No need to rewrite the history.

The rest operations like commit are impacted as well because git requires GC from time to time which is impacted by the repo size.

There's a side concern as well: disk space for all involved parties. The ones who clone, Travis, GitHub, etc, etc.

This sounds rather theoretical, performance issues with git commit in a 62M repository? What I was up to with my question was: do you actually have issues with the size of the repository, or is this premature optimization?

By the way, you didn't provide a rationale why this specific repo must not be rebased.

Nothing special with this repository, just the general issue that now all forks and all local copies of the repository are out of sync, causing extra work for contributors. Balancing gain and cost of rewriting history is of course up to @goneall.

vlsi commented 5 years ago

do you actually have issues with the size of the repository

It was so slow to download so I went ahead and created an issue.

Nothing special with this repository

This repo has 0 PRs, and just 38 forks. So it should not hurt much.