github / licensed

A Ruby gem to cache and verify the licenses of dependencies
MIT License
976 stars 122 forks source link

Licensed::Git.version() is returning a bogus SHA from submodules #80

Closed lildude closed 5 years ago

lildude commented 6 years ago

As part of updating the cached licenses in Linguist in https://github.com/github/linguist/pull/4268, I've just updated the Licensed gem (locally) from ~> 1.1.0 to ~> 1.3.0 which has pulled in Licensed 1.3.3.

Within script/licensed we're calling Licensed::Git.version() as follows:

Dir.glob(@glob).map do |directory|
  Licensed::Dependency.new(directory, {
    "type" => Filesystem.type,
    "name" => File.basename(directory),
    "version" => Licensed::Git.version(directory)
  })
end

... where directory is a git submodule for each grammar.

With Licensed 1.1.0, this determined the correct SHA for each submodule, but on 1.3.3 it returns a completely bogus SHA as seen with the most recent update of the cached licenses I performed after updating the gem:

$ git --no-pager diff vendor/licenses/grammar/ABNF.tmbundle.txt
diff --git a/vendor/licenses/grammar/ABNF.tmbundle.txt b/vendor/licenses/grammar/ABNF.tmbundle.txt
index 9afaeecb..e14f7c48 100644
--- a/vendor/licenses/grammar/ABNF.tmbundle.txt
+++ b/vendor/licenses/grammar/ABNF.tmbundle.txt
@@ -1,10 +1,10 @@
 ---
 type: grammar
 name: ABNF.tmbundle
-version: 0042b3d7ee21b2f800a8f80d42761513f16d9a52
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
 license: apache-2.0
 ---
-Apache License
+                                 Apache License
                            Version 2.0, January 2004
                         http://www.apache.org/licenses/
$
$ cd vendor/grammars/ABNF.tmbundle
$ git rev-parse HEAD
0042b3d7ee21b2f800a8f80d42761513f16d9a52
$
$ git show a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
fatal: bad object a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
$

That same SHA has been applied to multiple license files:

$ git --no-pager diff | grep -B2  a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
 name: ABNF.tmbundle
-version: 0042b3d7ee21b2f800a8f80d42761513f16d9a52
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
--
 name: AutoHotkey
-version: dc20ea3b615d902fc43045c1073353af5f9e0edf
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
--
 name: Lean.tmbundle
-version: e56b352bfc2bfc6b28154a8d7d68e469a454386b
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
--
 name: PHP-Twig.tmbundle
-version: 4fe0a237dc534a2d94a468f0ab1319c09ae1b573
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
--
 name: atomic-dreams
-version: 234b52d4828230d7741c35af2681619b55e972eb
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
--
 name: language-jsoniq
-version: 5e400b06f1b417746c4c6aff42a2a938b6ef2f82
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
--
 name: llvm.tmbundle
-version: 5db1a160f09b041777d1ed8f1a98a4386a363e38
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
--
 name: objective-c.tmbundle
-version: fb271290491fda66730f5dd90abd5c3787fb7c5e
+version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
$

Further digging shows we're getting a lot of duplicate SHAs:

$ git --no-pager diff | grep "+version" | sort | uniq -c | sort -r
  39 +version: 8369d253f87d15eacb51e4c0c8ea71927358d069
  26 +version: f9c00deecfb63efad37108f7b13699fc74bccbfe
  23 +version: 2942bf9f1bf7a4871167c046c1e27d26cb198200
  23 +version: 15e2b74dec453726188e1ae3faafd2e7eb3dc191
  20 +version: e60384b01890863754a9249405f57abd9622248c
  16 +version: ed44c0b86d317b753c607742b137dce9594bed97
  14 +version: b41875df5df876ae8ac8717baff9c5f26632b886
  10 +version: ea3e79a631fb2cd1f1b3c7fa9c6f68892b478892
   8 +version: a4d12cc8e42aae26a9469edb9fedb25a7cdbc9e9
   6 +version: d39a75b68b5870f1275eb27e1c5c41383d5eff4f
   5 +version: c59c88f16e672daeae30db3ba2bc701f5657c402
   4 +version: d2c7d27d13affd542d318a1f51ae77d96cb9c652
   4 +version: 4c66582f877c1846bdab00496543c3a038e867c6
   3 +version: edabdc75a5373e84dccc059a48cfc2d7c38bbde1
   3 +version: e335d48625c33acfe1a38e8f225af31649afa093
   3 +version: cb5bc91fe3550f6332c94e19bc74b0a2489bfb27
   3 +version: a944769d61795e2d7aea17e502996dbf250f261c
   3 +version: 9ca6a5841e93ea3f8a45adb3ac79bcc878b85892
   3 +version: 99dcd501aa7fa8d4efb4459ae79841de48a562a8
   3 +version: 994bc1f13560661f94c880845c967c91116e2f87
   3 +version: 68b553ea55f042a596f83c258e6accf4ba15b004
   3 +version: 2c3069db77b1d0a89330a5fff5fd3349db1e93e6
   2 +version: b802045c5c339ffe1e5a3b36d140e1d74a002dea
   2 +version: aae48b5b5eb358ac3e09c354972de38d647b5ad0
   2 +version: a18ad1d489f740ccf52f77f528c61794818eedbb
   2 +version: 6e40de47da9e774b7de722cf2dacf2d74f40595b
   2 +version: 50c08bf29e6e4de3623d5e7177c5e05ed1200348
   2 +version: 25aa6669bed3456326c8b1a15a9de767d09ff54e
   1 +version: f0c7380132465bd4a378b4e174f6bee896a96f1e
[...]
$

Every single license now has a new version and they're all wrong from what I can see from a quick glimpse - I don't fancy checking all 309 right now 😉

jonabc commented 6 years ago

@lildude ah it's not bogus. This (😞 unanticipated but arguably correct) change in behavior is because licensed no longer moves into the target directory before obtaining a commit SHA. Those SHA's that you're seeing are the linguist commits that updated the submodules, not the SHAs of the individual submodules.

I don't think this is a bug.

lildude commented 6 years ago

This (😞 unanticipated but arguably correct) change in behavior is because licensed no longer moves into the target directory before obtaining a commit SHA. Those SHA's that you're seeing are the linguist commits that updated the submodules, not the SHAs of the individual submodules.

🤯 yup, definitely unanticipated, and sort of makes sense now you mention it.

I don't think this is a bug.

🤔 If this is expected, it means new licenses won't be picked up until submodule updates have been committed:

$ git submodule update --remote
remote: Counting objects: 4, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 4 (delta 3), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (4/4), done.
From https://github.com/codemirror/CodeMirror
   af3bd431..ab3e78af  master     -> origin/master
Submodule path 'vendor/CodeMirror': checked out 'ab3e78afb0bc7f0bf56a2038539b2853715e38aa'
Submodule path 'vendor/grammars/JSyntax': checked out 'daf9ff2cb011571825338b57c48b87fa8cca065d'
Submodule path 'vendor/grammars/MagicPython': checked out 'ad6bd6211944a1b24c6594fa75ee3b15f60d7559'
Submodule path 'vendor/grammars/NimLime': checked out 'd71fd1ae94a8597cec72445bc87c760497a9763c'
[...]
$
$ script/licensed | grep Caching
Caching licenses for linguist:  <--- I'D EXPECT LICENSE CHANGE DETECTION HERE
$
$ git commit -am 'Updqte grammars' 
[testing be906be4] Updqte grammars
 38 files changed, 38 insertions(+), 38 deletions(-)
$
$ script/licensed | grep Caching
Caching licenses for linguist:
    Caching atom-language-purescript (be906be46e9fbc94b1e7bcfd9bf53d060a4c14a4)
    Caching language-apl (be906be46e9fbc94b1e7bcfd9bf53d060a4c14a4)
    Caching quake (be906be46e9fbc94b1e7bcfd9bf53d060a4c14a4)
    Caching sublime-MuPAD (be906be46e9fbc94b1e7bcfd9bf53d060a4c14a4)
    Caching language-csound (be906be46e9fbc94b1e7bcfd9bf53d060a4c14a4)
    Caching language-roff (be906be46e9fbc94b1e7bcfd9bf53d060a4c14a4)
    Caching objective-c.tmbundle (be906be46e9fbc94b1e7bcfd9bf53d060a4c14a4)
[...]
$

This seems very counter-intuitive to me and is likely to lead to the assumption by many using a similar approach to us, that there has been no change to the license(s) until much later, possibly too late. Some may also not like the pollution of their history that comes from having to commit the submodule updates and then the license updates separately.

We're effectively in the same boat as the way Go dependencies currently work, which I note you implemented a workaround for as part of the change in https://github.com/github/licensed/pull/78 so I guess I'm going to need to do the same for Linguist and our custom source.

I'm not sure if or how this should be documented as I suspect we're in a unique position at the moment.

jonabc commented 6 years ago

If this is expected, it means new licenses won't be picked up until submodule updates have been committed: This seems very counter-intuitive to me and is likely to lead to the assumption by many using a similar approach to us, that there has been no change to the license(s) until much later, possibly too late.

IMO this is desired and how it should be working, at least when using the Git commit SHA as a version identifier. Having licensed pick up changes before they're committed into a repo sounds like a bug to me, as it doesn't necessarily reflect the state of the project that is distributed to others.

We're effectively in the same boat as the way Go dependencies currently work, which I note you implemented a workaround for as part of the change in #78

To be clear, that was not a workaround for submodules that are committed into the repo. That was a workaround for downloaded external projects that exist outside the repo.

I guess I'm going to need to do the same for Linguist and our custom source.

Sure. To be honest I'm not sure thats the correct thing to be doing as you're opening Linguist up to having mismatches between the committed submodule reference and whats in the license metadata.

lildude commented 6 years ago

To be honest I'm not sure thats the correct thing to be doing as you're opening Linguist up to having mismatches between the committed submodule reference and whats in the license metadata.

I'm not sure I follow how. Can you please elaborate.

From my experimenting this morning, having it this way means the cached license is tied to the version in the submodule SHA that we reference in Linguist. Both are committed to the repo at the same time as part of the slightly modified release process.

jonabc commented 6 years ago

Can you please elaborate.

Sure! If someone were to update a submodule reference, cache license metadata then update a submodule reference again, there would be a mismatch.

I guess as long as the method of obtaining the submodule reference is consistent then licensed would complain about the metadata being out of date when checking metadata status.

🤔

jonabc commented 5 years ago

@lildude theres now a source available for git submodules that will use the SHA of the submodule as the version string. With that source available then any manifest can and should continue to use commit SHAs for the root repo as opposed to the submodule.

👍 to close this issue?

lildude commented 5 years ago

👍 to close this issue?

👍 looks like it might do what I've implemented in Linguist. I'll look to update Linguist when I've got the bandwidth. Thanks for implementing this. 🙇 Closing.