udibr / headlines

Automatically generate headlines to short articles
MIT License
526 stars 150 forks source link

how to get the training data? #1

Open rsarxiv opened 8 years ago

rsarxiv commented 8 years ago

How can I get the training data?

udibr commented 8 years ago

start from buzzfeed.com/archive/ huffingtonpost.com/archive/2005-5-1 . For gawker, gizmodo, deadspin use a project on github called kinja; for each of these sites start from /sitemap.xml

rsarxiv commented 8 years ago

thanks a lot!

hipoglucido commented 8 years ago

Hello :) I have some questions about generating datasets to train this model:

  1. Correct me if I am wrong, but as far as I have read, the state of the art regarding abstractive summaries has not reached the point of summarizing "long" (~3000 character) articles successfully. I think that the idea behind your 684K (article_first_paragraph, article_headline) dataset could be understood as trainning with (paragraph, paragraph_idea) instances. Am I right?
  2. My aim is to abstract the concepts behind a "long" article. Therefore, I think that I could try to get a dataset of (paragraph, paragraph_idea) instances, train the model, divide my target articles into paragraphs/chunks (probably using any extractive summary technique) and run predictions on every chunk. How does that sound to you?
  3. Do you think it is worth to make the effort of getting more than 1M instances or maybe after some hundreds of thousands there won't be any improvements?
  4. My intuition tells me that keywords are useful features. Why don't you use them? Does your code support the option of using it?
  5. How good/bad idea do you think it is to mix spanish instances with english instances in the same train set?

Thanks a lot for your work

udibr commented 8 years ago

1 yes it is a different problem than document summarization. Usually the first paragraph of a web news article contains the info you need for a headline, compared to summarizing a novel in which the interesting part happens after 500 pages... 2 I would just try to concatenate all the small chunks (lets say each one is a sentence) as if it was one paragraph 3 more data is always better, the only problem is that different sources have different styles of writing 4 only few source have keywords and they appear to be inconsistent, so I never tried using them 5 it accidently did work for me before I noticed that I few spanish examples in my training set... the RNN generated spanish when needed. so it all depends on what usage you are planning. maybe mixing languages would make a more interesting result

hipoglucido commented 8 years ago

Adding up to the question of @rsarxiv answered by @udibr I would like to suggest http://gdeltproject.org/ as news source.

xtr33me commented 7 years ago

I am new to NLG and currently trying to learn via a small project of interest and yours peaked my interest the most. I was trying to create a dataset based on what I believed was expected. I wrote a scraper and then take the headline and description and run them through nltk's tokenizer which I then return as separate lists. I then perform a cPickle.dump passing titles, descriptions, keywords (which right now is empty).

When I run this through vocabulary-embedding I was getting an error upon reading in and calling set due to me passing in a list I guess (new to python also). So then I called pickle.dump and created the dataset passing the title and descriptions to tuple but now that has created some issues on its own. I was just curious what the expected toekns.pkl file was expected to look like. If you can shed any light on this, it would help greatly. Thanks.

-- Just to add a bit more to this: So currently in my data tokenizer I am tokenizing each title and appending to my list. Now when I go about dumping this via pickle I am calling it like so where titles and descriptions are lists. pickle.dump(tuple((titles,descriptions,keywords)), filename)

If I perform a print before the dump on titles[0], I am getting: [u'new', u'information', u'found']

This is correct going in right? I question whether it is due to looking at the notebook where you index heads[i] and retrieve 'Remainders : Super wi-fi edition'

However I thought that when tokenizing, it involves separating not only each sentence but also each individual word. Perhaps I just need to modify my tokenizer file, but I wanted to first clarify whether my assumption is correct in where I may be going wrong.

-------- Edit I am posting this just incase anyone as noobish as me has the same issue. I was able to align this more closely with the notebook by performing a join like so: titles.append(" ".join(tokenize(article[0])))

The other stupid thing I was doing is passing in the three lists: titles, desc and keywords like this: pickle.dump(titles,descriptions,keywords, f, -1)

By simply adding the parenthesis around the three as Ehud had stated in the readme, this passed the three in as 1 param and fixed the problem.

udibr commented 7 years ago

NLG should be NLP. In https://github.com/udibr/headlines/blob/master/vocabulary-embedding.ipynb in cell #7 you can see how the information is organized in the pickle. Its just a tuple of 3 lists (heads, desc, keywords) in cell #11,12 and 13 you see a single item from each of these lists. Basically it is a string this is very simple so just work a little bit more on your python. I think that you are already there

xtr33me commented 7 years ago

Thanks for the reply and the share of the project!

czygithub commented 7 years ago

How can I get the training data from buzzfeed.com/archive/ or huffingtonpost.com/archive/2005-5-1?

imranshaikmuma commented 7 years ago

rsarxiv can you help me in getting the data(training and test)? please

ibarrien commented 6 years ago

@czygithub: try: http://www.huffingtonpost.com/archive/ e.g. http://www.huffingtonpost.com/archive/2017-3-2

ibarrien commented 6 years ago

@udibr could you share how to access your specific training and test data so that we may attempt to reproduce your results? This would be very helpful (and scientific!)

Note that much of the buzzfeed archive consists of headlines + videos/image captions, instead of headlines + text "desc"

shahdivyam commented 6 years ago

@udibr It would be great If you could share the training data used by you to train the model. Thanks

Zierzzz commented 6 years ago

@udibr I am fuzzy about how to get the train data set, so it would be a great thing If you could share the training data used by you to train the model. Thanks a lot!

nauman-akram commented 6 years ago

@hipoglucido did you work on the idea you mentioned of chunking the data (maybe through extractive approach) and predict new value then? and If you have full working of that code or this summarization code in Python 3.x kindly share it.

hipoglucido commented 6 years ago

I tried it, the results were interesting but not good enough. Sorry, I can't share it since I don't have access to the code any more.

2018-04-06 0:36 GMT+02:00 fzr2009 notifications@github.com:

@hipoglucido https://github.com/hipoglucido did you work on the idea you mentioned of chunking the data (maybe through extractive approach) and predict new value then? and If you have full working of that code or this summarization code in Python 3.x kindly share it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/udibr/headlines/issues/1#issuecomment-379097230, or mute the thread https://github.com/notifications/unsubscribe-auth/ANAk8pdIjQu7fFg748bE2MTQWVxTVEC7ks5tlpxvgaJpZM4IOpka .