recsyschallenge / 2017

40 stars 24 forks source link

Dataset zip appears to be corrupted #4

Closed danielegrattarola closed 7 years ago

danielegrattarola commented 7 years ago

Hi,

sorry to open an issue about 5 milliseconds after the competition started, but we tried to download the dataset zip on different PCs and OSs and we found that the file seems to be corrupted somehow. The problem seems to be related to the interactions.csv file, which is cut off at about the 16 millionth line and which weights 400MB once exctracted (1.4 GB compressed).

Is this a problem on our end or does anyone else have this problem?

Thanks, Daniele

fabianabel commented 7 years ago

Hi Daniele,

I cannot re-produce the problem. I downloaded the dataset now via 2 different channels (https://recsys.xing.com and via scp from the machines that actually host the data) and unzipped the files on Ubuntu 16.04 (using UnZip 6.00) and Mac OS 10.10.5 (using UnZip 5.52) and both works, e.g. on Mac OS:

$  unzip -v
UnZip 5.52 ....

$ unzip data_2017.zip
Archive:  data_2017.zip
  inflating: interactions.csv
  inflating: items.csv
  inflating: targetItems.csv
  inflating: targetUsers.csv
  inflating: users.csv

$ ls -alh
... 1.4G   Mar  3 22:53   data_2017.zip
... 8.4G   Mar  2 15:50   interactions.csv
... 226M   Mar  1 21:29   items.csv
... 341K   Mar  2 18:01   targetItems.csv
... 550K   Mar  3 16:02   targetUsers.csv
... 82M    Mar  1 21:27   users.csv

$ wc -l interactions.csv
322776003 interactions.csv

$ head interactions.csv
recsyschallenge_v2017_interactions_final_anonym_training_export.user_id recsyschallenge_v2017_interactions_final_anonym_training_export.item_id recsyschallenge_v2017_interactions_final_anonym_training_export.interaction_type    recsyschallenge_v2017_interactions_final_anonym_training_export.created_at
2082156 80  1   1484299172
1934123 140 1   1486388563
1320213 240 1   1479409825
297303  310 1   1484817366
1635596 310 1   1486370081
857319  340 1   1485121421
324595  350 1   1484591946
510320  350 1   1484841341
499620  390 1   1479387826

I'm not sure why it does not work out for you. @danielegrattarola if the problem remains (e.g. once you tried again to download the data) or if anyone else has the same problem then please let @dkohlsdorf or myself know by commenting in this issue. Thank you!

What you should see once you downloaded the data and unzipped it (unzip data_2017.zip) are the following files:

File Size Number of lines  Description
interactions.csv ca. 8.5G 322776003  interactions between users and items  
users.csv ca. 82M 1497021 details about users
items.csv ca. 226M 1306055 details about items
targetUsers.csv ca. 550K 74841 IDs of users to whom item recommendations can be pushed
targetItems.csv ca. 340K 46559 IDs of items for which users (from targetUsers.csv) should be identified that may be interested in the item

More details about the dataset, see: Dataset description. We will also try to publish some stats about the dataset soon.

Cheers, fabian

danielegrattarola commented 7 years ago

Hi Fabian, thanks for the reply, it must be something related to our PCs then. We'll try again and let you know if the problem persist, but I guess we should be able to solve this if it's just on our end.

I'll leave the issue open just in case somebody else has this problem. Thanks again, Daniele

jbochi commented 7 years ago

data looks okay for me. I got the same number of lines and file sizes @fabianabel has posted.

md5 sum is 28cbf5dad71582e9a204c43afdd86cfc

phsimon commented 7 years ago

Hi,

For me it's OK on Ubuntu but get error with windows 7 (same errors as reported above, interactions.csv (pb with CRC ?).

Best Regards.