UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Hi,haseebs: I downloaded enwiki 20180420 300d. pkl.bz2 on Wiki, whether the preconditionedembeddingfile =. / data / enwiki 20180420 300d. pkl.bz2 Or =. / data / enwiki 20180420 300d.pkl。 When running the code, I encountered the following problems. How can I solve them？

(venv) gzdx@2080:/home/OLD/rhl/OWE$ python owe/run_open_world.py -t -c ./data/FB15k-237-zeroshot -d /home/OLD/rhl/OWE/ --complex ./embeddings/ 19:20:05: INFO: Git Hash: 7735eb5da1d21ea47132b6c503f645d7f5b3c2e0

19:20:05: INFO: Reading config from: /home/OLD/rhl/OWE/config.ini 19:20:05: INFO: Using entity2wikidata.json as wikidata file data/FB15k-237-zeroshot/train.txt data/FB15k-237-zeroshot/valid_zero.txt data/FB15k-237-zeroshot/train.txt data/FB15k-237-zeroshot/valid_zero.txt 19:20:05: INFO: 12324 distinct entities in train having 235 relations (242489 triples). 19:20:05: INFO: 6038 distinct entities in validation having 220 relations (9424 triples). 19:20:05: INFO: 8897 distinct entities in test having 224 relations (22393 triples). 19:20:05: INFO: Working with: 14405 distinct entities having 235 relations. 19:20:05: INFO: Converting entities... 19:20:06: INFO: Building Vocab... 19:20:06: INFO: Building triples... 19:20:17: INFO: Loading word vectors from: ./data/enwiki_20180420_300d.pkl.bz2... Traceback (most recent call last): File "owe/run_open_world.py", line 164, in main() File "owe/run_open_world.py", line 99, in main word_vectors = data.load_embedding_file(Config.get('PretrainedEmbeddingFile')) File "/home/OLD/rhl/OWE/owe/data.py", line 67, in load_embedding_file return KeyedVectors.load_word2vec_format(embedding_file, binary=not embedding_file.endswith(".txt")) File "/home/OLD/rhl/venv/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 1547, in load_word2vec_format return _load_word2vec_format( File "/home/OLD/rhl/venv/lib/python3.8/site-packages/gensim/models/utils_any2vec.py", line 276, in _load_word2vec_format header = utils.to_unicode(fin.readline(), encoding=encoding) File "/home/OLD/rhl/venv/lib/python3.8/site-packages/gensim/utils.py", line 368, in any2unicode return unicode(text, encoding, errors=errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Try converting the the pkl to bin using something like this

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('filename.pkl', binary=False)
model.save_word2vec_format('filename.bin', binary=True)

Try converting the the pkl to bin using something like this

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('filename.pkl', binary=False)
model.save_word2vec_format('filename.bin', binary=True)

Thank you very much for your reply, I will continue to try. In addition, I would also like to ask the following question: the average word embedding function is better in short text description than in long text description. So LSTM, CNN's effect on long text description is not as good as average function? If so, what do you think is the main reason?

Try converting the the pkl to bin using something like this
from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('filename.pkl', binary=False)
model.save_word2vec_format('filename.bin', binary=True)
Thank you very much for your reply, I will continue to try. In addition, I would also like to ask the following question: the average word embedding function is better in short text description than in long text description. So LSTM, CNN's effect on long text description is not as good as average function? If so, what do you think is the main reason?

We had tested this but I am not sure what the reason is. It is likely that the range of architectures/hyperparams for the LSTM/CNNs that we tried were bad and you might have a better luck.

Generally, we've found that bigger and more complicated architectures performed worse than simply averaging and applying an affine mapping. Even when BERT was used, there was an improvement but it wasn't that significant. So it was not worth using, since it takes much longer to train a full size BERT as compared to a simple 300x300 matrix.

However, it is not like training more params always results in negligible or worse performance. For example, in the followup work, it was shown that training with separate matrix for each of the relation resulted in a big improvement.

I successfully run, I converted the word vector file of txt to bin format, but there are some problems. ① The results are not as good as those in Table 3 config.ini In Table 3, whether the batchSize =256 or 128 in config.ini and the UseTargetFilteringshi is true or false, the difference is about 10%. What is the parameter configuration for the effect in Table 3? ② When epoch = 101, earlystopping occurs; ③ When epoch = 0, hit @ is very small. When epoch = 1, hit @ value becomes very large, and the change is very small. Is this normal? The running results in batch size = 128, complex-ow-300 and fb15k-237-owe are as follows:

09:47:51: INFO: Reading config from: results/config.ini 09:47:51: INFO: Using entity2wikidata.json as wikidata file 09:47:52: INFO: 12324 distinct entities in train having 235 relations (242489 triples). 09:47:52: INFO: 6038 distinct entities in validation having 220 relations (9424 triples). 09:47:52: INFO: 8897 distinct entities in test having 224 relations (22393 triples). 09:47:52: INFO: Working with: 14405 distinct entities having 235 relations. 09:47:52: INFO: Converting entities... 09:47:52: INFO: Building Vocab... 09:47:52: INFO: Building triples... 09:48:07: INFO: Loading word vectors from: ./enwiki_20180420_300d.bin... 09:48:38: INFO: Building embedding matrix 09:48:38: INFO: Loading word vectors for entities... 09:48:41: INFO: Matched entities with 'ID': 0 09:48:41: INFO: Matched entities with 'ENTITIY/ID' (wiki2vec notation): 0 09:48:41: INFO: Matched entities with augmented phrase as wiki2vec entity (level:count): {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0} 09:48:41: INFO: Matched entities with augmented phrase (level:count): {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0} 09:48:41: INFO: Unmatched entities: 0 09:48:41: INFO: Created word embedding with shape: torch.Size([18804, 300]) /home/OLD/rhl/OWE/owe/data.py:298: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.) self.vectors[i] = torch.tensor(vectors[t]) 09:48:42: INFO: LinkPredictionModelType: ComplEx 09:48:42: INFO: Initialized ComplEx model with 12324 entities, 235 relations embedded with 300 dimensions 09:48:42: INFO: Loading pretrained embeddings from embeddings into ComplEx model 09:48:42: INFO: Loading from emb ((12324, 300)) to our emb (torch.Size([12324, 300])) 09:48:42: INFO: Loaded 12324/12324 rows. 09:48:42: INFO: Loading from emb ((12324, 300)) to our emb (torch.Size([12324, 300])) 09:48:42: INFO: Loaded 12324/12324 rows. 09:48:42: INFO: Loading from emb ((235, 300)) to our emb (torch.Size([235, 300])) 09:48:42: INFO: Loaded 235/235 rows. 09:48:42: INFO: Loading from emb ((235, 300)) to our emb (torch.Size([235, 300])) 09:48:42: INFO: Loaded 235/235 rows. 09:48:42: WARNING: Config setting: 'cuda' not found. Returning None! 09:48:46: INFO: Using averaging encoder 09:48:46: INFO: Using Affine transformation 09:48:46: INFO: Starting evaluation on validation set. 09:48:46: INFO: Performing evaluation without using the transformation Evaluation: : 118it [00:29, 4.02it/s]
09:49:16: INFO: [Eval: validation] Epoch: 0 09:49:16: INFO: [Eval: validation] 0.77 Hits@1 (%) 09:49:16: INFO: [Eval: validation] 1.27 Hits@3 (%) 09:49:16: INFO: [Eval: validation] 2.15 Hits@10 (%) 09:49:16: INFO: [Eval: validation] 1.30 MRR (filtered) (%) 09:49:16: INFO: [Eval: validation] 1.27 MRR (raw) (%) 09:49:16: INFO: [Eval: validation] Mean rank: 5622 09:49:16: INFO: [Eval: validation] Mean rank raw: 5635 Train: : 948it [00:39, 24.19it/s]
09:49:55: INFO: At epoch 1. Train Loss: 6.656136347271722 09:49:55: INFO: Starting evaluation on validation set. Evaluation: : 118it [00:29, 4.01it/s]
09:50:24: INFO: [Eval: validation] Epoch: 1 09:50:24: INFO: [Eval: validation] 26.60 Hits@1 (%) 09:50:24: INFO: [Eval: validation] 36.86 Hits@3 (%) 09:50:24: INFO: [Eval: validation] 47.26 Hits@10 (%) 09:50:24: INFO: [Eval: validation] 33.75 MRR (filtered) (%) 09:50:24: INFO: [Eval: validation] 25.17 MRR (raw) (%) 09:50:24: INFO: [Eval: validation] Mean rank: 668 09:50:24: INFO: [Eval: validation] Mean rank raw: 681 09:50:24: INFO: Saving checkpoint to results/checkpoint.OWE.pth.tar. 09:50:26: INFO: Saved best checkpoint to results/best_checkpoint.OWE.pth.tar. Train: : 948it [00:38, 24.45it/s]
09:51:05: INFO: At epoch 2. Train Loss: 5.626452792545914 ...... 11:45:03: INFO: [Eval: validation] Epoch: 101 11:45:03: INFO: [Eval: validation] 23.97 Hits@1 (%) 11:45:03: INFO: [Eval: validation] 34.37 Hits@3 (%) 11:45:03: INFO: [Eval: validation] 44.56 Hits@10 (%) 11:45:03: INFO: [Eval: validation] 31.15 MRR (filtered) (%) 11:45:03: INFO: [Eval: validation] 23.69 MRR (raw) (%) 11:45:03: INFO: [Eval: validation] Mean rank: 728 11:45:03: INFO: [Eval: validation] Mean rank raw: 742 11:45:03: INFO: Saving checkpoint to results/checkpoint.OWE.pth.tar. 11:45:04: INFO: [EarlyStopping] Improvement of MRR over the last 10 epochs less than 0.001. Stopping training!

Results of the test: 12:02:58: INFO: LinkPredictionModelType: ComplEx 12:02:58: INFO: Initialized ComplEx model with 12324 entities, 235 relations embedded with 300 dimensions 12:02:58: INFO: Loading pretrained embeddings from embeddings into ComplEx model 12:02:59: INFO: Loading from emb ((12324, 300)) to our emb (torch.Size([12324, 300])) 12:02:59: INFO: Loaded 12324/12324 rows. 12:02:59: INFO: Loading from emb ((12324, 300)) to our emb (torch.Size([12324, 300])) 12:02:59: INFO: Loaded 12324/12324 rows. 12:02:59: INFO: Loading from emb ((235, 300)) to our emb (torch.Size([235, 300])) 12:02:59: INFO: Loaded 235/235 rows. 12:02:59: INFO: Loading from emb ((235, 300)) to our emb (torch.Size([235, 300])) 12:02:59: INFO: Loaded 235/235 rows. 12:02:59: WARNING: Config setting: 'cuda' not found. Returning None! 12:03:05: INFO: Using averaging encoder 12:03:05: INFO: Using Affine transformation 12:03:05: INFO: Loading checkpoint results/best_checkpoint.OWE.pth.tar. 12:03:06: INFO: Done loading checkpoint from epoch 3. 12:03:06: INFO: Initialized OWE model, mapper and optimizer from the loaded checkpoint. 12:03:06: INFO: Starting evaluation on test set. Evaluation: : 280it [01:08, 4.06it/s]
12:04:14: INFO: [Eval: test] Epoch: 3 12:04:14: INFO: [Eval: test] 27.99 Hits@1 (%) 12:04:14: INFO: [Eval: test] 38.82 Hits@3 (%) 12:04:14: INFO: [Eval: test] 49.19 Hits@10 (%) 12:04:14: INFO: [Eval: test] 35.39 MRR (filtered) (%) 12:04:14: INFO: [Eval: test] 26.16 MRR (raw) (%) 12:04:14: INFO: [Eval: test] Mean rank: 550 12:04:14: INFO: [Eval: test] Mean rank raw: 563

You've obtained the results for table 4 with this config. If you set UseTargetFilteringshi=True, then you should be able to obtain the results from table 3.

Haseebs, thank you very much for your answer. But when I set UseTargetFilteringshi=True, I still could not achieve the effect shown in Table 3, with a difference of about 5%, is this normal? Is this related to the fact that I only use one GPU and the embedded text is converted from txt to format? The test results are as follows: 13:39:03: INFO: Reading config from: results/config.ini 13:39:03: INFO: Using entity2wikidata.json as wikidata file 13:39:04: INFO: 12324 distinct entities in train having 235 relations (242489 triples). 13:39:04: INFO: 6038 distinct entities in validation having 220 relations (9424 triples). 13:39:04: INFO: 8897 distinct entities in test having 224 relations (22393 triples). 13:39:04: INFO: Working with: 14405 distinct entities having 235 relations. 13:39:04: INFO: Converting entities... 13:39:04: INFO: Building Vocab... 13:39:04: INFO: Building triples... 13:39:18: INFO: Loading word vectors from: ./enwiki_20180420_300d.bin... 13:39:46: INFO: Building embedding matrix 13:39:46: INFO: Loading word vectors for entities... 13:39:48: INFO: Matched entities with 'ID': 0 13:39:48: INFO: Matched entities with 'ENTITIY/ID' (wiki2vec notation): 0 13:39:48: INFO: Matched entities with augmented phrase as wiki2vec entity (level:count): {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0} 13:39:48: INFO: Matched entities with augmented phrase (level:count): {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0} 13:39:48: INFO: Unmatched entities: 0 13:39:48: INFO: Created word embedding with shape: torch.Size([18804, 300]) /home/OLD/rhl/OWE/owe/data.py:298: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.) self.vectors[i] = torch.tensor(vectors[t]) 13:39:50: INFO: LinkPredictionModelType: ComplEx 13:39:50: INFO: Initialized ComplEx model with 12324 entities, 235 relations embedded with 300 dimensions 13:39:50: INFO: Loading pretrained embeddings from embeddings into ComplEx model 13:39:50: INFO: Loading from emb ((12324, 300)) to our emb (torch.Size([12324, 300])) 13:39:50: INFO: Loaded 12324/12324 rows. 13:39:50: INFO: Loading from emb ((12324, 300)) to our emb (torch.Size([12324, 300])) 13:39:50: INFO: Loaded 12324/12324 rows. 13:39:50: INFO: Loading from emb ((235, 300)) to our emb (torch.Size([235, 300])) 13:39:50: INFO: Loaded 235/235 rows. 13:39:50: INFO: Loading from emb ((235, 300)) to our emb (torch.Size([235, 300])) 13:39:50: INFO: Loaded 235/235 rows. 13:39:50: WARNING: Config setting: 'cuda' not found. Returning None! 13:39:54: INFO: Using averaging encoder 13:39:54: INFO: Using Affine transformation 13:39:54: INFO: Loading checkpoint results/best_checkpoint.OWE.pth.tar. 13:39:54: INFO: Done loading checkpoint from epoch 3. 13:39:54: INFO: Initialized OWE model, mapper and optimizer from the loaded checkpoint. 13:39:54: INFO: Starting evaluation on test set. Evaluation: : 280it [01:08, 4.10it/s] 13:41:02: INFO: [Eval: test] Epoch: 3 13:41:02: INFO: [Eval: test] 28.13 Hits@1 (%) 13:41:02: INFO: [Eval: test] 38.75 Hits@3 (%) 13:41:02: INFO: [Eval: test] 49.24 Hits@10 (%) 13:41:02: INFO: [Eval: test] 35.46 MRR (filtered) (%) 13:41:02: INFO: [Eval: test] 26.22 MRR (raw) (%) 13:41:02: INFO: [Eval: test] Mean rank: 558 13:41:02: INFO: [Eval: test] Mean rank raw: 571

Are you saying that you are getting the same results whether UseTargetFilteringShi is True or False? Make sure that it is spelled correctly in the config (Shi not shi) and is in the proper place. Additionally, you can print its value in the code to make sure that it is actually True. You should be getting around 40% filtered MRR when it is True and around 35% when False.

Also, just noticed your other questions. Yes, it is normal for the model to be trained in less than 5 epochs with the given learning rate on FB15k-237-OWE dataset.

haseebs / OWE

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #13