Methods to reduce dataset size / number of entities

Note: This isn't an issue with the original code. I would like to request help in understanding the errors I get from reducing the dataset.

The Lookup embedder needs all entity nodes in memory. Due to the large OLPBench, accommodating the entire entity node set may not be possible for many on standard computing power of around 24 GB RAM. (As the LSTM embedder doesn't depend on number of nodes, I have been able to run it.)

To this extent, I have been working on seeing how it would be possible to reduce the OKG dataset. Knowledge graphs are trickier because the valid/test sets depend on the entity-mentions available in train. Thus, to reduce the dataset:

I took the first 100k lines of train, retained only those nodes in validation that either have a subject/object in common with train, and similarly for test.
A new "entity_id_map" (in the picture below, that's loaded into "entity_id_map_DL_100k") has only these appearing entities, which is just 80,605 entity-mentions. Changing entity's nn.Embedding input dimension manually to this number helps to easily fit in memory and could facilitate faster iterating through ideas or experiments. -- The base code also reads in the count but doesn't do anything with it. So the counts are just initialized to 1 in this file.
Instead of giving the Lookup Embedder a direct entity-mention id, I take the row number from this (new) map file instead and use that for indexing into the embedder. Thus, the main memory concern of larger ID number (I see that the code directly takes the max ID from the map) is addressed thus, instead of changing IDs in other files from scratch.

This is the piece of code I insert just before passing slot_item to the embedding layer: ('e' shows that I am reducing the IDs only entities, not for relations.)

However, although the training does start and even runs for a few iterations, it eventually encounters either of two errors: 1) Slot KeyError I tried looking at the relevant code from dataset.py, Line 801 - it seems that slot should be between 0 to 2, so I need help in understanding where such a value for slot comes from. (It's large enough to not be an entity id too.) Different iterations produce different slot errors, though many are repeated.

2) Not finding an element from slot_item into the new map file for the row-indexing described earlier. (The last line is my custom try-except error message. The actual error message prints that there is no axis to index into (the index[0]) because no rows from the search where returned)

When I search for the rogue slot_item (here 2201600), I do not find it in my reduced dataset (train/valid/test) nor in my new entity_id map. This is quite confusing, because slot_item was meant to be taken from the train dataset, so how did it get this ID? -- Moreover, it seems from my multiple iterations that almost all of the "missing" IDs are the last element in slot_item. Could this be a coincidence? The one time the slot_item element wasn't the last element was a "0", which again, no entity ID corresponds to from the map.

On the original dataset, one can manually set the input dimension size of the Lookup nn.Embedding, and when passing the slot_item to it, just modulo it with that new input dimension. Though obviously erroneous, I see that none of the errors from above appear.

Having a reduced dataset helps to make this work more accessible (to those with lesser compute), which is very important to encourage further interest in the field. As I have been doing my best to get this to work for a long time - any help in this regard would be highly appreciated!

samuelbroscheit / open_knowledge_graph_embeddings

Methods to reduce dataset size / number of entities #6