raphael-sch / SumVAE

8 stars 6 forks source link

Vocab.txt file #1

Open gaurav2699 opened 2 years ago

gaurav2699 commented 2 years ago

Vocab.txt file is not provided in the repo, hence I created a vocab.txt file on my own. First try, I put all dictionary words in the file, but that really slowed down the computation and crashed my laptop as there were many words. Second try, I made a vocab file containing the top 10,000 English words but in that case also many words in the train dataset were not recognised as they are not part of the vocab file, which is giving suboptimal results. Can you please let me know how to create the vocab.txt file or if you can share the vocab.txt file you used that will be great. Thanks a lot!

raphael-sch commented 2 years ago

Hello Gaurav, as the config file indicates, the Top 7000 words are used. Rare words are replaced by unk tokens. It is expected that the summaries then also contain those unk tokens. Note that the project is 4 years old and subword tokens were not widely adopted back then, but the above procedure was more common. If you want to generate unsupervised sentence summaries, I highly recommend using the following project which is much more robust and generates better results (but also includes unk tokens): https://github.com/raphael-sch/HC_Sentence_Summarization

gaurav2699 commented 2 years ago

Hey Raphael, Thank you so much for replying. Even after putting top 7000 words in the vocab.txt file, the results are optimal and it doesnt seem the model is learning to summarise the sentence properly. For instance these are the results: Screenshot from 2022-03-08 22-58-13

What could be the reason for such outputs, I feel something wrong with the autoencoder model. Thanks!

mentaltraffic commented 11 months ago

Hello I am trying to replicate the results of this paper. Can you share with me the vocab.txt used by you @raphael-sch @gaurav2699 ? Thank you!