microsoft / ALEX

A library for building an in-memory, Adaptive Learned indEX
MIT License
662 stars 114 forks source link

Could I have the four datasets? #8

Closed kaiwang19 closed 3 years ago

kaiwang19 commented 3 years ago

Hi, I am wondering if I could get the four datasets in the paper?

longitudes longlat lognormal YCSB The given sample dataset is only 200M, which is part of the longitudes dataset. I have also tried to extract the dataset from the open street map, but I assume that there must be a strategy that you use to select the longitudes or the latitudes. Could you mention a little about this strategy? Thanks.

jialinding commented 3 years ago

You can now find links to the datasets in the README.

The longitude and latitude values should be GPS coordinates from randomly-selected locations in OSM. But I did not generate the longitude and latitude values myself, so I don't know the exact selection procedure.

kaiwang19 commented 3 years ago

Thanks a lot!

kaiwang19 commented 3 years ago

Dear Jialing,

I found that the lognormal dataset and YCSB dataset cannot be run properly if you bulk load them. Could you double-check if the two datasets are the original ones in the paper?

For the lognormal dataset, there are 190M keys.

For the YCSB dataset, there are 200M keys.

I thus debug the code to see what has happened. The problem is at Line 731 of 'alex.h'. The if condition will decide if num_keys <= derivedparams.max_data_node_slots data_nodetype::kMinDensity. derivedparams.max_data_node_slots is 1,048,576, and the data_nodetype::kMinDensity is 0.6, thus less than 1,048,576 0.6 = 629,145.6 keys are fine for bulk loading, but if there are too many keys in lognormal or YCSB, ALEX cannot handle.

The weird thing is that I test the same amount of keys on longitudes and longlat, everything is fine.

I thus doubt if the lognormal dataset and YCSB dataset are correct? Or should I set some parameters specifically for the two datasets? Thanks.

jialinding commented 3 years ago

I can't reproduce these errors. Can you try running the benchmark executable, as described in the README? For example, to bulk load 700K keys from YCSB, change line 16 of src/benchmark/main.cpp to #define KEY_TYPE uint64_t, then run this command:

./build/benchmark \
--keys_file=[path to location of YCSB dataset, might need to be an absolute path] \
--keys_file_type=binary \
--init_num_keys=700000 \
--total_num_keys=1000000 \
--batch_size=100000 \
--insert_frac=0.5 \
--lookup_distribution=zipf \
--print_batch_stats
kaiwang19 commented 3 years ago

Thank you so much, the problem is solved now. I used int64_t before, so I could not succeed. When change int64_t to uint64_t, everything is fine. Thank you.