marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Training Wikidata embedding #55

Closed dlauc closed 3 years ago

dlauc commented 3 years ago

I'm trying to create embeddings for Wikidata, using this conf file [general] device=CPU num_train=611058458 num_nodes=91580024 num_valid=612283 num_test=612283 experiment_name=wikidata num_relations=1390 ...

However, I am getting the error:

ValueError: cannot create std::vector larger than max_size()

Looking for any workaround, thanks

JasonMoho commented 3 years ago

Not sure what's causing this issue, will need some more context to help get you up and running.

Could you send the full program output / stack trace?

dlauc commented 3 years ago

Sure, the full stack trace is: [info] [08/07/21 10:46:37.710] Start preprocessing Traceback (most recent call last): File "/projekti/venv37/bin/marius_train", line 8, in sys.exit(main()) File "/projekti/venv37/lib/python3.7/site-packages/marius/console_scripts/marius_train.py", line 8, in main m.marius_train(len(sys.argv), sys.argv) ValueError: cannot create std::vector larger than max_size()

JasonMoho commented 3 years ago

Let's check a few things to isolate the issue:

Try training with another dataset/configuration, this will help us figure out if it's a build/environment issue. Try the fb15k example in the readme. The steps are as follows:

Make sure to run these steps within the same working directory so that the configuration file paths are resolved properly And If everything works fine you should see training output displayed to the terminal.

If you encounter no issues with these steps, please send me your full Wikidata configuration file to train and the commands you used for preprocessing and training.

dlauc commented 3 years ago

@JasonMoho thanks for your reply. I've already tested all examples and they work fine (I've had some initial problems with Pythons 3.9 and 3.8, but with 3.7 everything works fine).

The command I've used for preprocessing is: marius_preprocess ./wiki --files /backup/wikidata/graph.csv --delim "," --dataset_split .001 .001 --num_partitions 16

The input file is a parse from the Wikidata dump, 612283024 lines like: Q31,P1344,Q1088364 Q31,P1151,Q3247091 Q31,P1546,Q1308013 ....

Please find attached the config I've tried for training. The OS is Ubuntu 18.04.5, the Marius codebase is the latest master branch.

[general] device=CPU num_train=611058458 num_nodes=91580024 num_valid=612283 num_test=612283 experiment_name=wikidata num_relations=1390

[storage] num_paritions=16 edges_backend=FlatFile embeddings_backend=PartitionBuffer relations_backend=DeviceMemory num_partitions=1 buffer_capacity=10 prefetching=true

[training] synchronous=false

[path] base_directory=data/ train_edges=./train_edges.pt validation_edges=./valid_edges.pt test_edges=./test_edges.pt node_ids=./node_mapping.bin relations_ids=./rel_mapping.bin

JasonMoho commented 3 years ago

Thanks for sending those! The issue is likely due to the configuration file.

Couple changes are needed to the config file:

Hopefully those changes will fix the issue.

I also see that you are using CPU training, this will take quite a bit of time to train on Wikidata as CPU computation will be the bottleneck. If interested, I can provide some config options that will speed up training if you provide some further info:

dlauc commented 3 years ago

@JasonMoho thank you so much for the detailed instructions. I've corrected the errors in the config file, updated the codebase, redone the preprocessing but the error 'Traceback (most recent call last): File "/projekti/venv37/bin/marius_train", line 8, in sys.exit(main()) File "/projekti/venv37/lib/python3.7/site-packages/marius/console_scripts/marius_train.py", line 8, in main m.marius_train(len(sys.argv), sys.argv) ValueError: cannot create std::vector larger than max_size()' stills occur.

My use case is entity matching, and my plan is to use Wikidata entity embedding for similarity estimation. I've tried to use FB Biggraph pre-trained embeddings but it performs worse than my baseline (simple Jaccard similarity on the first level links). I know that it will be slow on CPU (have 64 cores, 256GB ram) but will play with parameters to finish training in a reasonable time.

JasonMoho commented 3 years ago

Today I will update the system to include better debug information. Once that's ready, you can run again in debug mode and that should help us isolate this issue.

JasonMoho commented 3 years ago

Few things for you:

Config:

[general]
device=CPU
num_train=611058458
num_nodes=91580024
num_valid=612283
num_test=612283
experiment_name=wikidata
num_relations=1390

[model]
embedding_size=128 // Increasing results in better accuracy with slower training.

[storage]
edges_backend=HostMemory // Size of edges is only ~15GB to store, fits in CPU memory
embeddings_backend=HostMemory // Size of embedding table + optimizer state is only ~94GB, fits in CPU memory
relations_backend=HostMemory

[training]
batch_size=10000 
num_chunks=10
negatives=500 // Increase the number of negative samples results in better accuracy with slower training.
num_epochs=10 // On large datasets, these models converge close to peak accuracy at abut 3-10 epochs in my experience 
synchronous=false

// Some pipeline settings to best utilize your 64 cores. 
[training_pipeline]
max_batches_in_flight=64 
num_embedding_loader_threads=8
num_compute_threads=16
num_embedding_update_threads=8

[evaluation]
batch_size=1000 
max_batches_in_flight=64 
num_embedding_loader_threads=8
num_evaluate_threads=32

[path]
base_directory=data/
train_edges=./train_edges.pt
validation_edges=./valid_edges.pt
test_edges=./test_edges.pt
node_ids=./node_mapping.bin
relations_ids=./rel_mapping.bin
dlauc commented 3 years ago

@JasonMoho thank you so much - the training is working now - will share the embedding is they are any good

DominikFilipiak commented 2 years ago

@dlauc were the embeddings any good? If so, would you like to share them?

dlauc commented 2 years ago

@DominikFilipiak, unfortunately not; my baseline (Jaccard) was better, so I've not kept the embeddings.

JasonMoho commented 2 years ago

@dlauc Interesting that these embeddings perform worse than the Jaccard baseline. What was the evaluation scenario & metrics you used to determine this?

We've recently added some updates to the system which improve the quality of the learned embeddings. Namely, support for GNN models, node features, and sampling schemes which reduce the biases induced by partitioned training.

I'd like to see if I can reproduce your evaluation scenario on wikikg90m and possibly get some embeddings which outperform the Jaccard baseline. If so, I can make the embeddings or the training configuration publicly available depending on the hosting costs.

thodrek commented 2 years ago

@dlauc which model did you use over Marius for learning these embeddings?