samuelbroscheit / open_knowledge_graph_embeddings

Code to train open knowledge graph embeddings and to create a benchmark for open link prediction.
MIT License
25 stars 7 forks source link

Methods to reduce dataset size / number of entities #6

Open FauzanFarooqui opened 1 year ago

FauzanFarooqui commented 1 year ago

Note: This isn't an issue with the original code. I would like to request help in understanding the errors I get from reducing the dataset.

The Lookup embedder needs all entity nodes in memory. Due to the large OLPBench, accommodating the entire entity node set may not be possible for many on standard computing power of around 24 GB RAM. (As the LSTM embedder doesn't depend on number of nodes, I have been able to run it.)

To this extent, I have been working on seeing how it would be possible to reduce the OKG dataset. Knowledge graphs are trickier because the valid/test sets depend on the entity-mentions available in train. Thus, to reduce the dataset:

This is the piece of code I insert just before passing slot_item to the embedding layer: image ('e' shows that I am reducing the IDs only entities, not for relations.)

However, although the training does start and even runs for a few iterations, it eventually encounters either of two errors: 1) Slot KeyError image I tried looking at the relevant code from dataset.py, Line 801 - it seems that slot should be between 0 to 2, so I need help in understanding where such a value for slot comes from. (It's large enough to not be an entity id too.) Different iterations produce different slot errors, though many are repeated.

2) Not finding an element from slot_item into the new map file for the row-indexing described earlier. image (The last line is my custom try-except error message. The actual error message prints that there is no axis to index into (the index[0]) because no rows from the search where returned)

On the original dataset, one can manually set the input dimension size of the Lookup nn.Embedding, and when passing the slot_item to it, just modulo it with that new input dimension. Though obviously erroneous, I see that none of the errors from above appear.

Having a reduced dataset helps to make this work more accessible (to those with lesser compute), which is very important to encourage further interest in the field. As I have been doing my best to get this to work for a long time - any help in this regard would be highly appreciated!