Closed jtoy closed 7 years ago
It looks like X is explained explicitly in the trainer README but I cannot figure out what C is either (the "source sentences.") In trainer README author writes
Suppose that you have a list of strings available for training, where the contents of the entries are contiguous (so the (i+1)th entry is the sentence that follows the i-th entry. As an example, you can download our BookCorpus dataset, which was used for training the models available on the main page. Lets call this list X. Note that each string should already be tokenized (so that split() will return the desired tokens).
Any help on C appreciated @ryankiros .
My thought is: From decoding perspective, you should have two sentence as inputs, one is the conditioned sentence, the other is decoder input sentence, and another sentence as target(decoder output sentence).
For example, Translation problem(English -> French): X is a French sentence(target sentence), C is a English sentence(the conditioned source sentence, also as decoder input sentence) C is used twice.
thank you for your response @junfenglx. In the romance novel scenario which this project refers to, what do you see the target and source being?
hi, @ncoronges @jtoy i believe for the neuralstoryteller project, X should be the romance novels, C should be Coco captions, and model is skipthoughts. but i am not sure. i am currently training my own decoder with this configuration, if it works fine then i ll let you know. @jtoy i guess you should supply one sentence per line for both X and C data. In my opinion C is ms COCO data set captions, so your X data should follow same format. i used "from nltk.tokenize import sent_tokenize" to tokenize sentences from my books collection, and nothing more.
If anyone else is having trouble creating a decoder, I've written simplified code to do that here: https://github.com/danielricks/penseur
I'm trying to train a custom decoder. I am using the bookcorpus data with science fiction novels. According to the docs: https://github.com/ryankiros/skip-thoughts/blob/master/decoding/README.md
"We assume that you have two lists of strings available: X which are the target sentences and C which are the source sentences. "
what exactly is X and C with these books? I see the bookcorpus data are just text files. I'm not sure what kind of processing im supposed to do here to get them in the right format for training.