Closed dlauc closed 3 years ago
Not sure what's causing this issue, will need some more context to help get you up and running.
Could you send the full program output / stack trace?
Sure, the full stack trace is:
[info] [08/07/21 10:46:37.710] Start preprocessing
Traceback (most recent call last):
File "/projekti/venv37/bin/marius_train", line 8, in
Let's check a few things to isolate the issue:
Try training with another dataset/configuration, this will help us figure out if it's a build/environment issue. Try the fb15k example in the readme. The steps are as follows:
marius_preprocess output_dir/ --dataset fb15k
config.ini
and copy the contents of the example https://github.com/marius-team/marius/blob/main/examples/training/configs/fb15k_cpu.ini into it.marius_train config.ini
Make sure to run these steps within the same working directory so that the configuration file paths are resolved properly And If everything works fine you should see training output displayed to the terminal.
If you encounter no issues with these steps, please send me your full Wikidata configuration file to train and the commands you used for preprocessing and training.
@JasonMoho thanks for your reply. I've already tested all examples and they work fine (I've had some initial problems with Pythons 3.9 and 3.8, but with 3.7 everything works fine).
The command I've used for preprocessing is: marius_preprocess ./wiki --files /backup/wikidata/graph.csv --delim "," --dataset_split .001 .001 --num_partitions 16
The input file is a parse from the Wikidata dump, 612283024 lines like: Q31,P1344,Q1088364 Q31,P1151,Q3247091 Q31,P1546,Q1308013 ....
Please find attached the config I've tried for training. The OS is Ubuntu 18.04.5, the Marius codebase is the latest master branch.
[general] device=CPU num_train=611058458 num_nodes=91580024 num_valid=612283 num_test=612283 experiment_name=wikidata num_relations=1390
[storage] num_paritions=16 edges_backend=FlatFile embeddings_backend=PartitionBuffer relations_backend=DeviceMemory num_partitions=1 buffer_capacity=10 prefetching=true
[training] synchronous=false
[path] base_directory=data/ train_edges=./train_edges.pt validation_edges=./valid_edges.pt test_edges=./test_edges.pt node_ids=./node_mapping.bin relations_ids=./rel_mapping.bin
Thanks for sending those! The issue is likely due to the configuration file.
Couple changes are needed to the config file:
num_paritions=16
and num_partitions=1
are set. Remove these and change to num_partitions=16
. --num_partitions 16
, the file ./wiki/train_edges_partitions.txt
should have been created, this file keeps track of the number of edges in each edge partition. Set this option in the [path] section:train_edges_partitions=./train_edges_partitions.txt
(assuming you are working from the ./wiki
directory)Hopefully those changes will fix the issue.
I also see that you are using CPU training, this will take quite a bit of time to train on Wikidata as CPU computation will be the bottleneck. If interested, I can provide some config options that will speed up training if you provide some further info:
DistMult
and d=128
@JasonMoho thank you so much for the detailed instructions. I've corrected the errors in the config file, updated the codebase, redone the preprocessing but the error 'Traceback (most recent call last):
File "/projekti/venv37/bin/marius_train", line 8, in
My use case is entity matching, and my plan is to use Wikidata entity embedding for similarity estimation. I've tried to use FB Biggraph pre-trained embeddings but it performs worse than my baseline (simple Jaccard similarity on the first level links). I know that it will be slow on CPU (have 64 cores, 256GB ram) but will play with parameters to finish training in a reasonable time.
Today I will update the system to include better debug information. Once that's ready, you can run again in debug mode and that should help us isolate this issue.
Few things for you:
Debug mode is merged. Please install the latest version and try running again. The debugging output will be written to the log file: logs/wikidata_debug.log
. Or if you prefer, the debugging information can be printed to the terminal by setting reporting.log_level=debug
I've created a configuration file attached below which should better utilize your machine. Note that I removed partitioning options from the config, that is because this dataset will fit in the CPU memory of your machine and thus doesn't need to be partitioned. Also, you shouldn't need to preprocess the dataset again for this to run.
I'm still investigating but there's a good chance the std::vector larger than max_size()
issue is coming from the partitioning part of the code path. So using the configuration below may provide a workaround to the issue.
Config:
[general]
device=CPU
num_train=611058458
num_nodes=91580024
num_valid=612283
num_test=612283
experiment_name=wikidata
num_relations=1390
[model]
embedding_size=128 // Increasing results in better accuracy with slower training.
[storage]
edges_backend=HostMemory // Size of edges is only ~15GB to store, fits in CPU memory
embeddings_backend=HostMemory // Size of embedding table + optimizer state is only ~94GB, fits in CPU memory
relations_backend=HostMemory
[training]
batch_size=10000
num_chunks=10
negatives=500 // Increase the number of negative samples results in better accuracy with slower training.
num_epochs=10 // On large datasets, these models converge close to peak accuracy at abut 3-10 epochs in my experience
synchronous=false
// Some pipeline settings to best utilize your 64 cores.
[training_pipeline]
max_batches_in_flight=64
num_embedding_loader_threads=8
num_compute_threads=16
num_embedding_update_threads=8
[evaluation]
batch_size=1000
max_batches_in_flight=64
num_embedding_loader_threads=8
num_evaluate_threads=32
[path]
base_directory=data/
train_edges=./train_edges.pt
validation_edges=./valid_edges.pt
test_edges=./test_edges.pt
node_ids=./node_mapping.bin
relations_ids=./rel_mapping.bin
@JasonMoho thank you so much - the training is working now - will share the embedding is they are any good
@dlauc were the embeddings any good? If so, would you like to share them?
@DominikFilipiak, unfortunately not; my baseline (Jaccard) was better, so I've not kept the embeddings.
@dlauc Interesting that these embeddings perform worse than the Jaccard baseline. What was the evaluation scenario & metrics you used to determine this?
We've recently added some updates to the system which improve the quality of the learned embeddings. Namely, support for GNN models, node features, and sampling schemes which reduce the biases induced by partitioned training.
I'd like to see if I can reproduce your evaluation scenario on wikikg90m and possibly get some embeddings which outperform the Jaccard baseline. If so, I can make the embeddings or the training configuration publicly available depending on the hosting costs.
@dlauc which model did you use over Marius for learning these embeddings?
I'm trying to create embeddings for Wikidata, using this conf file
[general] device=CPU num_train=611058458 num_nodes=91580024 num_valid=612283 num_test=612283 experiment_name=wikidata num_relations=1390 ...
However, I am getting the error:
ValueError: cannot create std::vector larger than max_size()
Looking for any workaround, thanks