src-d / datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Other
323 stars 82 forks source link

use MD5 hashes when available #58

Closed campoy closed 6 years ago

campoy commented 6 years ago

Fixes #36 Fixes #37

Signed-off-by: Francesc Campoy campoy@golang.org

bzz commented 6 years ago

Nice improvement @campoy !

I have tried runnning it with the corrupted (partial) and here is an output

$ go build -o pga.md5
$ ./pga.md5 list | wc -l

WARN[0000] could not check md5 hashes for latest.csv.gz, comparing timestamps instead: could not fetch hash at latest.csv.gz.md5: 404 Not Found
Error: unexpected EOF
Usage:
  pga list [flags]

Flags:
  -f, --format string      format of the output (url, csv, or json) (default "url")
  -h, --help               help for list
  -l, --lang stringSlice   list of languages that the repositories should have
  -u, --url string         regular expression that repo urls need to match

Global Flags:
  -v, --verbose   log more information

10589

I find it a bit confusing, although #57 should help, but still may be something can be improved here? I.e catching and printing "Most probably you have a corrupt index file in ~/.pga/" instead of printing a generic CLI help, together with Error: unexpected EOF ?

Here is FS stat

stat ~/.pga/latest.csv.gz
  File: /Users/alex/.pga/latest.csv.gz
  Size: 1573649     Blocks: 3080       IO Block: 4096   regular file
Device: 1000004h/16777220d  Inode: 23927156    Links: 1
Access: (0644/-rw-r--r--)  Uid: (  501/    alex)   Gid: (   20/   staff)
Access: 2018-05-14 11:35:50.000000000 +0200
Modify: 2018-05-14 11:16:57.000000000 +0200
Change: 2018-05-14 11:16:57.000000000 +0200
Birth: 2018-03-27 07:20:58.000000000 +0200
campoy commented 6 years ago

@bzz: regarding the failure when the index is corrupt there's probably some work to improve the usability. Could you file an issue specifically for that case?

Thanks for the thorough review!

bzz commented 6 years ago

regarding the failure when the index is corrupt

I think #57 would avoid that. Suggestion in https://github.com/src-d/datasets/pull/58#discussion_r187908797 was about adding user feedback on index file downloading.