How to use other data? How to create vocab.txt file?

lancopku / DPGAN

Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text (EMNLP2018)

145 stars 38 forks source link

How to use other data? How to create vocab.txt file? #2

Open wrapperband opened 6 years ago

wrapperband commented 6 years ago

How to use other data? How to create vocab.txt file?

The program crashed / stalled my PC after about 8 hours creating the training. How ever it was using CPU, so I tried to create a smaller data set.

I assumed : https://github.com/lancopku/DPGAN/blob/master/review_generation_dataset/generate_review.py is what formats the data.

I've being trying to read this program, I was / am hoping it formats the data some way, but there aren't any comments for a "non coder" to follow. I assumed I had to change the path? I'm on Linux.

generate_review.py L52 : file_path = "F:\dataset\yelp_dataset\sorted_data"

johndpope commented 6 years ago

pretty sure this is just a word2vec model - see here for training

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/

https://github.com/dav/word2vec

akhileshkumargangwar commented 6 years ago

Hi I am still unable to understand how vocab.txt created and why many words assigned same integer value?

jklj077 commented 6 years ago

@wrapperband

Yes, changing the path should work. The path should point to a directory that contians all the review files, which should be json files.

The script for generating vocab.txt is not released. But the format is quite simple. vocab.txt contains the word list for indexing. It is not an embedding file. Each line of vocab.txt contains (1) the lowered word and (2) its frequency in the training text, i.e., how many times it appears in the training text. The words are ranked by frequency so that the common words are in the front and the rare words are in the back.

Best regards

akhileshkumargangwar commented 6 years ago

Thank You.

johndpope commented 6 years ago

This unrelated code maybe able to be cherry picked - see the python code https://github.com/johndpope/vocab-mashup - it’s pretty impressive the smashing of text together. Can help augment training sets.

akhileshkumargangwar commented 6 years ago

Thanks I will check

akhileshkumargangwar commented 6 years ago

Hi, This DP-GAN code is showing lots of error. In discriminator_test/negative/*.txt not generating review.It is giving empty review . I want to learn the flow of GAN by debugging but it is taking lots of time to fix the error. Is there any other updated code. I also tried SeqGAN but they have used synthetic data. So please help me. I am unable to fix some errors also. Thanks

jingjingxupku commented 6 years ago

I do not meet your problem on my local datasets. I guess this problem is mainly attributed to the small training data. I just released a small subset of dataset for illustrating data format on current codes. Since the default epoch of training generator is set to 1, the generator learns nothing on this small dataset. Therefore, I increased the training epochs and this problem was fixed successfully. I have updated my latest codes, so please download it again. Furthermore, I released the whole dataset in google drive, you can download it from readme.md.

akhileshkumargangwar commented 6 years ago

Thanks a lot.