Training Wikidata embedding

dlauc commented 3 years ago

I'm trying to create embeddings for Wikidata, using this conf file [general] device=CPU num_train=611058458 num_nodes=91580024 num_valid=612283 num_test=612283 experiment_name=wikidata num_relations=1390 ...

However, I am getting the error:

ValueError: cannot create std::vector larger than max_size()

Looking for any workaround, thanks

JasonMoho commented 3 years ago

Not sure what's causing this issue, will need some more context to help get you up and running.

Could you send the full program output / stack trace?

dlauc commented 3 years ago

Sure, the full stack trace is: [info] [08/07/21 10:46:37.710] Start preprocessing Traceback (most recent call last): File "/projekti/venv37/bin/marius_train", line 8, in sys.exit(main()) File "/projekti/venv37/lib/python3.7/site-packages/marius/console_scripts/marius_train.py", line 8, in main m.marius_train(len(sys.argv), sys.argv) ValueError: cannot create std::vector larger than max_size()

JasonMoho commented 3 years ago

Let's check a few things to isolate the issue:

Try training with another dataset/configuration, this will help us figure out if it's a build/environment issue. Try the fb15k example in the readme. The steps are as follows:

Download and preprocess the dataset: marius_preprocess output_dir/ --dataset fb15k
Create a configuration file config.ini and copy the contents of the example https://github.com/marius-team/marius/blob/main/examples/training/configs/fb15k_cpu.ini into it.
Train with marius_train config.ini

Make sure to run these steps within the same working directory so that the configuration file paths are resolved properly And If everything works fine you should see training output displayed to the terminal.

If you encounter no issues with these steps, please send me your full Wikidata configuration file to train and the commands you used for preprocessing and training.

dlauc commented 3 years ago

@JasonMoho thanks for your reply. I've already tested all examples and they work fine (I've had some initial problems with Pythons 3.9 and 3.8, but with 3.7 everything works fine).

The command I've used for preprocessing is: marius_preprocess ./wiki --files /backup/wikidata/graph.csv --delim "," --dataset_split .001 .001 --num_partitions 16

The input file is a parse from the Wikidata dump, 612283024 lines like: Q31,P1344,Q1088364 Q31,P1151,Q3247091 Q31,P1546,Q1308013 ....

Please find attached the config I've tried for training. The OS is Ubuntu 18.04.5, the Marius codebase is the latest master branch.

[general] device=CPU num_train=611058458 num_nodes=91580024 num_valid=612283 num_test=612283 experiment_name=wikidata num_relations=1390

[storage] num_paritions=16 edges_backend=FlatFile embeddings_backend=PartitionBuffer relations_backend=DeviceMemory num_partitions=1 buffer_capacity=10 prefetching=true

[training] synchronous=false

[path] base_directory=data/ train_edges=./train_edges.pt validation_edges=./valid_edges.pt test_edges=./test_edges.pt node_ids=./node_mapping.bin relations_ids=./rel_mapping.bin

JasonMoho commented 3 years ago

Thanks for sending those! The issue is likely due to the configuration file.

Couple changes are needed to the config file:

In the storage section of the config file num_paritions=16 and num_partitions=1 are set. Remove these and change to num_partitions=16.
After running preprocessing with --num_partitions 16, the file ./wiki/train_edges_partitions.txt should have been created, this file keeps track of the number of edges in each edge partition. Set this option in the [path] section:train_edges_partitions=./train_edges_partitions.txt (assuming you are working from the ./wiki directory)

Hopefully those changes will fix the issue.

I also see that you are using CPU training, this will take quite a bit of time to train on Wikidata as CPU computation will be the bottleneck. If interested, I can provide some config options that will speed up training if you provide some further info:

How many CPU cores are available?
CPU memory size?
Do you have a specific model or embedding dimension in mind? If not set these will be the defaults: DistMult and d=128
What's the end use case of the trained embeddings? Do you prefer higher embedding quality or faster training speed?

dlauc commented 3 years ago

@JasonMoho thank you so much for the detailed instructions. I've corrected the errors in the config file, updated the codebase, redone the preprocessing but the error 'Traceback (most recent call last): File "/projekti/venv37/bin/marius_train", line 8, in sys.exit(main()) File "/projekti/venv37/lib/python3.7/site-packages/marius/console_scripts/marius_train.py", line 8, in main m.marius_train(len(sys.argv), sys.argv) ValueError: cannot create std::vector larger than max_size()' stills occur.

My use case is entity matching, and my plan is to use Wikidata entity embedding for similarity estimation. I've tried to use FB Biggraph pre-trained embeddings but it performs worse than my baseline (simple Jaccard similarity on the first level links). I know that it will be slow on CPU (have 64 cores, 256GB ram) but will play with parameters to finish training in a reasonable time.

JasonMoho commented 3 years ago

Today I will update the system to include better debug information. Once that's ready, you can run again in debug mode and that should help us isolate this issue.

JasonMoho commented 3 years ago

Few things for you:

Debug mode is merged. Please install the latest version and try running again. The debugging output will be written to the log file: logs/wikidata_debug.log. Or if you prefer, the debugging information can be printed to the terminal by setting reporting.log_level=debug
I've created a configuration file attached below which should better utilize your machine. Note that I removed partitioning options from the config, that is because this dataset will fit in the CPU memory of your machine and thus doesn't need to be partitioned. Also, you shouldn't need to preprocess the dataset again for this to run.
I'm still investigating but there's a good chance the std::vector larger than max_size() issue is coming from the partitioning part of the code path. So using the configuration below may provide a workaround to the issue.

Config:

[general]
device=CPU
num_train=611058458
num_nodes=91580024
num_valid=612283
num_test=612283
experiment_name=wikidata
num_relations=1390

[model]
embedding_size=128 // Increasing results in better accuracy with slower training.

[storage]
edges_backend=HostMemory // Size of edges is only ~15GB to store, fits in CPU memory
embeddings_backend=HostMemory // Size of embedding table + optimizer state is only ~94GB, fits in CPU memory
relations_backend=HostMemory

[training]
batch_size=10000 
num_chunks=10
negatives=500 // Increase the number of negative samples results in better accuracy with slower training.
num_epochs=10 // On large datasets, these models converge close to peak accuracy at abut 3-10 epochs in my experience 
synchronous=false

// Some pipeline settings to best utilize your 64 cores. 
[training_pipeline]
max_batches_in_flight=64 
num_embedding_loader_threads=8
num_compute_threads=16
num_embedding_update_threads=8

[evaluation]
batch_size=1000 
max_batches_in_flight=64 
num_embedding_loader_threads=8
num_evaluate_threads=32

[path]
base_directory=data/
train_edges=./train_edges.pt
validation_edges=./valid_edges.pt
test_edges=./test_edges.pt
node_ids=./node_mapping.bin
relations_ids=./rel_mapping.bin

dlauc commented 3 years ago

@JasonMoho thank you so much - the training is working now - will share the embedding is they are any good

DominikFilipiak commented 2 years ago

@dlauc were the embeddings any good? If so, would you like to share them?

dlauc commented 2 years ago

@DominikFilipiak, unfortunately not; my baseline (Jaccard) was better, so I've not kept the embeddings.

JasonMoho commented 2 years ago

@dlauc Interesting that these embeddings perform worse than the Jaccard baseline. What was the evaluation scenario & metrics you used to determine this?

We've recently added some updates to the system which improve the quality of the learned embeddings. Namely, support for GNN models, node features, and sampling schemes which reduce the biases induced by partitioned training.

I'd like to see if I can reproduce your evaluation scenario on wikikg90m and possibly get some embeddings which outperform the Jaccard baseline. If so, I can make the embeddings or the training configuration publicly available depending on the hosting costs.

thodrek commented 2 years ago

@dlauc which model did you use over Marius for learning these embeddings?

marius-team / marius

Training Wikidata embedding #55