src-d / datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Other
323 stars 82 forks source link

unexpected EOF leads to corrupted siva files. #36

Closed zurk closed 6 years ago

zurk commented 6 years ago

Using

cat index.csv | grep -oE '[0-9a-f]{40}\.siva' | pga get -i --output /media/k/data/PGA/

to download PGA dataset I get unexpected EOF just for several files:

➜  sourced cat index.csv | grep -oE '[0-9a-f]{40}\.siva' | pga get -i --output /media/k/data/PGA/
downloading siva files by name from stdin
filter flags will be ignored
 67503 / 257391 [====================>--------------------------------------------------------]  26.23% 40m24s
could not get siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: could not check mod time in http://pga.sourced.tech//siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: Head http://pga.sourced.tech//siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: dial tcp 147.135.10.8:80: i/o timeout
 91710 / 257391 [===========================>-------------------------------------------------]  35.63% 44m47s
could not get siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva: could not copy http://pga.sourced.tech//siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva to /media/k/data/PGA/siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva: unexpected EOF
 205637 / 257391 [===========================================================>--------------]  79.89% 2h27m11s
could not get siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva: could not copy http://pga.sourced.tech//siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva to /media/k/data/PGA/siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva: unexpected EOF
 206423 / 257391 [===========================================================>--------------]  80.20% 2h26m33s
could not get siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva: could not copy http://pga.sourced.tech//siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva to /media/k/data/PGA/siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva: unexpected EOF
 257391 / 257391 [========================================================================================================================================================] 100.00%

may be due to network problems or something else. But at the end these files were present. When I manually download them and put to the corresponding folder I found out that the sizes are really different. screenshot from 2018-03-16 14-27-59

So, it is better to delete such files or try to redownload it several times.

campoy commented 6 years ago

This will be fixed with https://github.com/src-d/datasets/issues/37, hopefully

zurk commented 6 years ago

yes, check md5 files is a really good idea for such a big dataset.