Closed Selva163 closed 6 years ago
@Selva163 this is related to https://github.com/tensorflow/models/issues/370
See https://github.com/tensorflow/models/pull/379/files for examples of making training data for the model.
@panyx0718 I have some manually curated short description from long description. now I want this to use as training data. what should be the format of the file and how should the data(long description) and label (short description) be arranged. it would be nice if you could show a sample training data file.
@Selva163 panyx0718 already has provided you the link for all you need to know to format your data. Use the link and run the binary_to_text conversion on the toy dataset. Then write a formatter for your own data so that it matches that dataset. Then run the text_to_binary against your data and thats it.
Unsure why this hasn't been closed yet, but should a sample source (non-binary) file be needed, please find one attached below. I still do recommend following @panyx0718 advice and check out the referenced data_convert_example.py.
@xtr33me thanks please let me check this again.
@xtr33me @panyx0718 Given that the training set
format is the Annotated English Gigaword dataset one, are one of this suitable for a real world training?
US Snapshot ID (Linux/Unix): snap-c1d156aa (US West)
Size: 40GB
Source: The Usenet
Created On: November 17, 2010
Last Updated: November 24, 2015
CzEng 1.0 (Czech-English Parallel Corpus, version 1.0) that has in English
232,691,149 a-layer
142,897,353 t-layer
This last one has been used for Moses statistical machine translation system.
AWS Public Datasets Fnally several encyclopedic corpus are provided by AWS.
Multi-language Parallel Corpora This collection of parallel corpora is linked by the Moses project.
Back to the question, which one (if any) can be used as training set for textsum
without changing (too much) the input data format?
Thank you.
@loretoparisi To be honest, any corpus can be used as long as it is providing you a clean headline and article. For me this has been the toughest part to acquire...clean data. With the USENET group datasets you aren't usually going to have a title and rather it will just be an article (what I found at least).
So I guess simply put, you can use any dataset that will provide you "Clean" data, containing a proper "heading" and an "article". Now this was one of the reasons I found that I didn't like the wikipedia corpus because the heading was too much like an encyclopedia. No creativity to the headline. Through all this you need to keep in mind with any AI, "Garbage in, garbage out" and the cleaner you can make your source data, the better your output result will be.
This all said, you still in the end need to format the data from your source to the binary files sent in. So as long as you have a lot of articles, and I do mean a lot, and you are happy with how well the article headline represents the first few sentences of the body, you can consider it good data to work with.
Now please keep in mind that I still have yet to have really great results but my training is still going on and I currently have been training against only 40k articles. I'm also going about scraping my own data at the moment as many of the corpora I was finding I wasn't happy with. I wanted to use this first set of articles to better understand how it was working, but the more I understand I now see just why more data is so important. This is an 'abstractive' model and it tries to truly infer what the next word is based on the surrounding words and the abundance of training data fed in that matched the similar sentence structure in other sentences. This is why just loads of data are needed.
Now one other important thing to note here is that you may need to train the data lower than 1.0 for smaller datasets. I found with my data source only being 40k articles, that I have had to go to a lower average loss to see even somewhat of a descent result that matches the reference. (Currently at 0.44) Now I believe this to be based on what I had stated above. As stated I am still training and playing around with this until I am able to scrape 1 million+ articles. If you are fortunate enough to have access to the Gigaword dataset, it seems to be the way to go, but I assume you are in the same boat I am and not having the ability to drop $6k for a learning project. Therefore you may need to either find which dataset above gets you the best result and roll with it. However just remember whichever was you go you will need to format it and just try to find a source that has a lot of articles.
I know this was a long-winded answer but I hope it helps somewhat. @panyx0718 will have better insight than I should he have anything to add.
I found it pretty easy to modify data.py to read for the text input format that I wanted. Easier than converting to the format that it accepts by default.
@xtr33me thanks for the clarification and the numbers, that's a good poin to clean the data before use! @theis188 that's an option I mean it seems that there is no standard at all a part TSV formatting of a text file, I see different dataset of the same kind with different formats, so you they should provide a parser plugin for tensorflow
or theano
, nltk
of whatever.
@xtr33me thanks for the clarification. I started with toy data comes with the project. I started the training last night, still Its going on, running_avg_loss is reached 0.000001. Default max_run_steps value is 10M, When the training will get over? Should I run and eval mode after the training completes?
Hey @bitbitsbyte ... you are doing the same thing I first did when messing with Tensorflow. You do not want to overfit, which is what you are doing with an avg loss of 0.0000001. One important thing to remember anytime you are training any model is to never overfit. What will happen if you do this is that your model will ONLY work with your training data and not be able to work with anything else.
That Said, you are also unfortunately not going to get any decent results with the test data provided. I found out that it was only provided as a way for you to see the flow, but not provide any good results. In my experience you are going to want like 800k+ articles.
One final note, you do not have to wait for the training to stop automatically. What you will want to do is open up a tensorboard instance and point it to your log_root directory. This will then allow you to see the avg loss chart, which you can then keep an eye on for when it is starting to overfit.
This image gives you an idea of what you are looking for when the model is starting to overfit. It is at that point that you want to just stop the training manually. Hope that helps. Good luck!
This is a very old post, but I can say one of the most interesting task in TensorFlow! So I'm updating here my findings about some real world example. Hopefully this will help TF authors to add something similar.
This tutorial shows how to train a Text Summarization model with the Gigaword Dataset and OpenNMT in C++ Text Summarization on Gigaword and ROUGE Scoring
There is a OpenNMT python tutorial here as well.
HarwardNLP has published its Text Summary Dataset (both training and evaluation) in the sent-summary repository. The compressed dataset size is 277MB:
├── [ 238] DUC2003
│ ├── [125K] input.txt
│ ├── [ 42K] task1_ref0.txt
│ ├── [ 44K] task1_ref1.txt
│ ├── [ 45K] task1_ref2.txt
│ └── [ 48K] task1_ref3.txt
├── [ 238] DUC2004
│ ├── [102K] input.txt
│ ├── [ 34K] task1_ref0.txt
│ ├── [ 35K] task1_ref1.txt
│ ├── [ 35K] task1_ref2.txt
│ └── [ 35K] task1_ref3.txt
├── [ 136] Giga
│ ├── [327K] input.txt
│ └── [101K] task1_ref0.txt
└── [ 204] train
├── [202M] train.article.txt.gz
├── [ 61M] train.title.txt.gz
├── [ 33M] valid.article.filter.txt
└── [9.6M] valid.title.filter.txt
Regarding the format:
$ head -n2 train/train.article.txt
australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed .
at least two people were killed in a suspected bomb attack on a passenger bus in the strife-torn southern philippines on monday , the military said .
$ head -n2 train/train.title.txt
australian current account deficit narrows sharply
at least two dead in southern philippines blast
There are the DUC 2004 and DUC 2003 datasets.
Now the interesting fact is that OpenNMT that provided a C++/Torch and a Python binding, now provides a Tensorflow binding here, so it it should be possibile to train the Text Summarization now with these datasets and TensorFlow following that tutorial. There are no examples for that in TF, and it would be nice to have one here.
Also this new framework can help to validate the text summarization task results: Multi-language evaluation framework for text summarization.
Hope this helps!
This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there. Thanks!
Please let us know which model this issue is about (specify the top-level directory)
I've trained the model using the command provided. But I don't see any folder 'train' in 'textsum/log_root/' directory. Since training is made on a sample file, will the model be able to work on real time test data? If not, how can I make training data and train the model? And most importantly how can I test / use the model to see the result summarization?