Dataset Format for Deep Learning

How do we store and manipulate such a huge dataset?

Option 1 : document tensor

Store the dataset as a 1-dim torch.IntTensor. We build a batch by indexing n-grams using a series of calls to torch.narrow(). We iterate through each n-gram to replace the values before the last ~~by ~~. Each expert dataset is represented as a set of indices (the index of the target word). The mapper shuffles these around.~~~~

Option 2 : table of sentence tensors

Store the dataset as a table of 1-dim torch.IntTensor, one per sentence. We build a batch by indexing a sentence and then its n-gram. The issue with this solution is that the table may not fit within the 1GB limit of the luajit interpreter. Each expert dataset is represented as a set of pairs (sentence index, target word index). The mapper shuffles these around.

Option 3 : C table of sentences dataset

We implement option 2 entirely in C. So the batch is built as an array of pointers to arrays. This solution would bypass the limits on interpreter memory. Only issue is that we would need a way to store it on disk.

Option 3A : postgreSQL storage

Use PostgreSQL to store the dataset and libpqtypes to connect to it. The database would be a table of rows (sentences) of arrays of integers (words).

Option 3B : file system storage

Parse files of integers, where each line is a sentence, and each sentence is a sequence of bytes.

nicholas-leonard / equanimity

Dataset Format for Deep Learning #42

Option 1 : document tensor

Option 2 : table of sentence tensors

Option 3 : C table of sentences dataset

Option 3A : postgreSQL storage

Option 3B : file system storage