microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

How I can use the CNTK for news summarization? #2148

Closed JafferWilson closed 7 years ago

JafferWilson commented 7 years ago

I am looking for news paper summarization tool that works on Sequence to Sequence. I want to know how I can apply the CNTK for text/news summarization and what are the datasets needed for training the model?? Is there already trained datasets or model then please do let me know?

cha-zhang commented 7 years ago

We have a tutorial on sequence to sequence: https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_204_Sequence_To_Sequence.ipynb Unfortunate we don't have an example for news paper summarization.

JafferWilson commented 7 years ago

@cha-zhang Can you help me in modifying the code to make the summarization possible. Its a bit urgent, if you can help.

cha-zhang commented 7 years ago

I'm sorry without knowing what kind of data you have it's difficult for us to help on this. We don't have a data set that aligns well with what you need.

JafferWilson commented 7 years ago

@cha-zhang Please take any datasets as you like and just let me know the demonstration of how the CNTK works with the text summarization. If you still didn't get what I mean then you can try this dataset: https://drive.google.com/file/d/0B6N7tANPyVeBNmlSX19Ld2xDU1E/view

skynode commented 7 years ago

Is this not supposed to be on stack overflow? Or Kaggle? @cha-zhang?

cha-zhang commented 7 years ago

Yes, Stack overflow is a much better place for this kind of questions. GitHub issue is for reporting actual issues with CNTK.

DHOFM commented 7 years ago

@JafferWilson Imho it is all mentioned in the link cha-zhang provided. i did not check your test data but you will need a large amount of input text and output sequences to train your model. This dataset is often mentioned in this case: https://catalog.ldc.upenn.edu/LDC2003T05 but you will have to pay 3.000 USD for it. For some deep dive into this topic you can check https://arxiv.org/abs/1509.00685 and https://arxiv.org/abs/1602.06023 Please stay us informed about your experiences...

JafferWilson commented 7 years ago

@DHOFM Even if I get the datasets of LDC, I do not know how to create the program. Can you just show me the way to do it?

DHOFM commented 7 years ago

@JafferWilson The code from cha-zhang could be used. If you get the data you need to create the dataset with pairs input and output sequences from it. The basic idea or algorithm is the same in the tutorial the outpout is a translation in your case a summarization But for handling such large datasets like the gigaword your machine setting will be large enough, too. What GPU(s) do you use for training and what is the goal ?

JafferWilson commented 7 years ago

I use mainly Tesla K80 16 GPUs for training. I was looking for a summarization solution based on the abstraction and not extraction. I have gone through Tensorflow and OpenNMT but didn't got any satisfying results. These libraries are not what I am looking for. Hence I thought to give CNTK a try if it works.

DHOFM commented 7 years ago

I onyl tested tensorflow some time ago and came to cntk which was perfect ( or near to it) for my needs but in your case, i wonder that tf is not what you need. They have a pretrained model under https://github.com/tensorflow/models/tree/master/textsum and they used the gigaset dataset for it. I can only guess in this case but i would guess that it should be easier to use than implementing your own for cntk...

JafferWilson commented 7 years ago

Yes I tried TextSum but it has given me worst summarization results. I do not know the reason yet, but if it is workable as you say then I need to recheck and rethink regarding the use.

sayanpa commented 7 years ago

Please reopen this issue after you had a chance to check and re-think. Given it is has been a week, closing it for now.