Open wrapperband opened 6 years ago
pretty sure this is just a word2vec model - see here for training
The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.
More information about the scripts is provided at https://code.google.com/p/word2vec/
Hi I am still unable to understand how vocab.txt created and why many words assigned same integer value?
@wrapperband
Yes, changing the path should work. The path should point to a directory that contians all the review files, which should be json files.
The script for generating vocab.txt is not released. But the format is quite simple. vocab.txt contains the word list for indexing. It is not an embedding file. Each line of vocab.txt contains (1) the lowered word and (2) its frequency in the training text, i.e., how many times it appears in the training text. The words are ranked by frequency so that the common words are in the front and the rare words are in the back.
Best regards
Thank You.
This unrelated code maybe able to be cherry picked - see the python code https://github.com/johndpope/vocab-mashup - it’s pretty impressive the smashing of text together. Can help augment training sets.
Thanks I will check
Hi, This DP-GAN code is showing lots of error. In discriminator_test/negative/*.txt not generating review.It is giving empty review . I want to learn the flow of GAN by debugging but it is taking lots of time to fix the error. Is there any other updated code. I also tried SeqGAN but they have used synthetic data. So please help me. I am unable to fix some errors also. Thanks
I do not meet your problem on my local datasets. I guess this problem is mainly attributed to the small training data. I just released a small subset of dataset for illustrating data format on current codes. Since the default epoch of training generator is set to 1, the generator learns nothing on this small dataset. Therefore, I increased the training epochs and this problem was fixed successfully. I have updated my latest codes, so please download it again. Furthermore, I released the whole dataset in google drive, you can download it from readme.md.
Thanks a lot.
How to use other data? How to create vocab.txt file?
The program crashed / stalled my PC after about 8 hours creating the training. How ever it was using CPU, so I tried to create a smaller data set.
I assumed : https://github.com/lancopku/DPGAN/blob/master/review_generation_dataset/generate_review.py is what formats the data.
I've being trying to read this program, I was / am hoping it formats the data some way, but there aren't any comments for a "non coder" to follow. I assumed I had to change the path? I'm on Linux.
generate_review.py L52 : file_path = "F:\dataset\yelp_dataset\sorted_data"