Closed nicholas-leonard closed 10 years ago
I tried Option 2. It doesn't fit in memory (luajit as a limit).
Made a https://github.com/nicholas-leonard/equanimity/blob/master/nlp/postgres2torch.lua to convert the SQL Table datasets into more manageable serialized torch.Tensors. We have two tensors, one that is the 1-dim corpus Tensor specified option A and another twice as large that specifies start and stop (torch.sub) indices for each n-gram. These may be smaller than context-size, since each sentence in the corpus is only separated by markers.
How do we store and manipulate such a huge dataset?
Option 1 : document tensor
Store the dataset as a 1-dim
torch.IntTensor
. We build a batch by indexing n-grams using a series of calls totorch.narrow()
. We iterate through each n-gram to replace the values before the lastby. Each expert dataset is represented as a set of indices (the index of the target word). The mapper shuffles these around.Option 2 : table of sentence tensors
Store the dataset as a table of 1-dim
torch.IntTensor
, one per sentence. We build a batch by indexing a sentence and then its n-gram. The issue with this solution is that the table may not fit within the 1GB limit of the luajit interpreter. Each expert dataset is represented as a set of pairs (sentence index, target word index). The mapper shuffles these around.Option 3 : C table of sentences dataset
We implement option 2 entirely in C. So the batch is built as an array of pointers to arrays. This solution would bypass the limits on interpreter memory. Only issue is that we would need a way to store it on disk.
Option 3A : postgreSQL storage
Use PostgreSQL to store the dataset and libpqtypes to connect to it. The database would be a table of rows (sentences) of arrays of integers (words).
Option 3B : file system storage
Parse files of integers, where each line is a sentence, and each sentence is a sequence of bytes.