tensorflow / models

Models and examples built with TensorFlow
Other
76.99k stars 45.78k forks source link

about result of textsum #618

Closed luyahan closed 7 years ago

luyahan commented 7 years ago

i trained the model using the toy data ,then i decode it the result of decoding is very strange exapmple :output= for the lastest runing_arg_loss is about 0.002 why cause it? IS the data set(toy data) is too smal ? i'm particularly grateful to your help

xtr33me commented 7 years ago

@luyahan In the future please look through open and closed tickets or perform a search first before posting a question as it helps keep the same questions from being opened multiple times. That said, refer to this ticket: https://github.com/tensorflow/models/issues/464

You will find a vocab file there that you can point to that will provide you some results, other than . I have issued a pull request for this file to be included, but the response so far is that it isn't supposed to work with the toy dataset and rather just provide an idea of the flow.

Something that is really important to note here though is that due to the fact that this is an abstractive model (not extractive), you will really need a lot of data to train against to get good results that are useable because the model is truly trying to generate a headline based on the input rather than just providing a reduced result by deleting words. This means that the model requires a lot of "clean" data which means that you will have to either procure your own dataset via scraping or pay for something like the Gigaword dataset. I reached out to LDC and they advised that they do provide some cheaper datasets form like NY Times or something, so should my attempts at scraping my own data not work I will be looking to that solution.

One final note, should you still be wondering why you are getting , it is due to the majority of the words in the toy dataset, not being included in the vocab file. All I did was run the toy dataset through a script to perform a count on the words and then appended those to the existing vocab file.

Hope that helps some. Please close this when you have the chance.

licaoyuan123 commented 7 years ago

@xtr33me I am also doing the research around automatic text summarization.

I am wondering if your scraped data works well?

Is it possible for you to open source your scraped data and code?

This code provided with paper Teaching Machines to Read and Comprehend opensourced scrap code which can get around 200,000 news articles from CNN and DailyMail.