Open budhiraja opened 8 years ago
If I had a corpus of documents, each document comprising of some number of sentences, should I put all of these sentences in the matrix X for training the encoder?
The instructions on the README indicate that the (i+1)-th entry is the sentence that follows the i-th sentence. But if we use multiple documents, then how do we indicate a certain set of contiguous sentences are for one document, and the next set are for another?
Or, should we train the encoder with one document at a time? Where each matrix corresponds to one document.
Yeah you can. I've written a method that does just that in penseur_utils.py. Check out the code here: https://github.com/danielricks/penseur