ryankiros / skip-thoughts

Sent2Vec encoder and training code from the paper "Skip-Thought Vectors"
2.05k stars 544 forks source link

building a custom decoder #25

Closed jtoy closed 7 years ago

jtoy commented 8 years ago

I'm trying to train a custom decoder. I am using the bookcorpus data with science fiction novels. According to the docs: https://github.com/ryankiros/skip-thoughts/blob/master/decoding/README.md

"We assume that you have two lists of strings available: X which are the target sentences and C which are the source sentences. "

what exactly is X and C with these books? I see the bookcorpus data are just text files. I'm not sure what kind of processing im supposed to do here to get them in the right format for training.

ncoronges commented 8 years ago

It looks like X is explained explicitly in the trainer README but I cannot figure out what C is either (the "source sentences.") In trainer README author writes

Suppose that you have a list of strings available for training, where the contents of the entries are contiguous (so the (i+1)th entry is the sentence that follows the i-th entry. As an example, you can download our BookCorpus dataset, which was used for training the models available on the main page. Lets call this list X. Note that each string should already be tokenized (so that split() will return the desired tokens).

Any help on C appreciated @ryankiros .

junfenglx commented 8 years ago

My thought is: From decoding perspective, you should have two sentence as inputs, one is the conditioned sentence, the other is decoder input sentence, and another sentence as target(decoder output sentence).

For example, Translation problem(English -> French): X is a French sentence(target sentence), C is a English sentence(the conditioned source sentence, also as decoder input sentence) C is used twice.

ncoronges commented 8 years ago

thank you for your response @junfenglx. In the romance novel scenario which this project refers to, what do you see the target and source being?

giddyyupp commented 7 years ago

hi, @ncoronges @jtoy i believe for the neuralstoryteller project, X should be the romance novels, C should be Coco captions, and model is skipthoughts. but i am not sure. i am currently training my own decoder with this configuration, if it works fine then i ll let you know. @jtoy i guess you should supply one sentence per line for both X and C data. In my opinion C is ms COCO data set captions, so your X data should follow same format. i used "from nltk.tokenize import sent_tokenize" to tokenize sentences from my books collection, and nothing more.

danielricks commented 7 years ago

If anyone else is having trouble creating a decoder, I've written simplified code to do that here: https://github.com/danielricks/penseur