Open wlngai opened 7 years ago
I just ran into this issue 5.5 years later :). If the program is interrupted during the cache file's generation, it will leave a partial file and the next execution will assume it is correct even when the number of rows is different between the two files:
$ wc -l /data/gx/graphs/cache/datagen-7_5-fb.e
30759439 /data/gx/graphs/cache/datagen-7_5-fb.e
$ wc -l /data/gx/graphs/datagen-7_5-fb.e
34185747 /data/gx/graphs/datagen-7_5-fb.e
The current Graphalytics assumes that the input graph and (cache) graph are correct, which is fine as if it is not then the corresponding benchmark runs will not pass validation. However, it is unclear to users why the validation failed, as they assume the input graph and (cache) graph are correct. These datasets can be accidentally corrupted for example when the caching process was interrupted.
A check-sum (e.g. sha1) should be implemented on these files for full validation.