nicholas-leonard / equanimity

Experimental research for distributed conditional computation
4 stars 0 forks source link

Dataset Format for Deep Learning #42

Closed nicholas-leonard closed 10 years ago

nicholas-leonard commented 10 years ago

How do we store and manipulate such a huge dataset?

Option 1 : document tensor

Store the dataset as a 1-dim torch.IntTensor. We build a batch by indexing n-grams using a series of calls to torch.narrow(). We iterate through each n-gram to replace the values before the last by . Each expert dataset is represented as a set of indices (the index of the target word). The mapper shuffles these around.

Option 2 : table of sentence tensors

Store the dataset as a table of 1-dim torch.IntTensor, one per sentence. We build a batch by indexing a sentence and then its n-gram. The issue with this solution is that the table may not fit within the 1GB limit of the luajit interpreter. Each expert dataset is represented as a set of pairs (sentence index, target word index). The mapper shuffles these around.

Option 3 : C table of sentences dataset

We implement option 2 entirely in C. So the batch is built as an array of pointers to arrays. This solution would bypass the limits on interpreter memory. Only issue is that we would need a way to store it on disk.

Option 3A : postgreSQL storage

Use PostgreSQL to store the dataset and libpqtypes to connect to it. The database would be a table of rows (sentences) of arrays of integers (words).

Option 3B : file system storage

Parse files of integers, where each line is a sentence, and each sentence is a sequence of bytes.

nicholas-leonard commented 10 years ago

I tried Option 2. It doesn't fit in memory (luajit as a limit).

nicholas-leonard commented 10 years ago

Made a https://github.com/nicholas-leonard/equanimity/blob/master/nlp/postgres2torch.lua to convert the SQL Table datasets into more manageable serialized torch.Tensors. We have two tensors, one that is the 1-dim corpus Tensor specified option A and another twice as large that specifies start and stop (torch.sub) indices for each n-gram. These may be smaller than context-size, since each sentence in the corpus is only separated by markers.