Closed tomhosking closed 1 year ago
Hi Tom,
Sorry for the late response.
You can find the format for training the VAE model here: https://github.com/megagonlabs/coop/blob/main/scripts/preprocess.py#L31-L34 and also refer to the following script to check the format for dev/test sets. https://github.com/megagonlabs/coop/blob/main/scripts/get_summ.py#L40-L45
Thanks!
Thanks for this - I'll try to submit a PR with the preprocessing and configs needed to train on SPACE.
I did notice that Yelp and Amazon contain only 8 input reviews for evaluation, whereas SPACE can contain hundreds - this leads to a OOM problem when calculating all possible combinations of inputs at inference time. Is there a preferred way to "prune" the inputs, or is it OK to just limit the input to a random selection of 8 reviews?
Awesome! Thanks a lot!
OOM problem
Yes, 2^100 combinations must be too big to explore, so I have two suggestions to apply COOP for the larger scale of input reviews:
Ideally, approach 1 would be better based on the findings in the paper, but implementation-wise, approach 2 could be much easier, I think. (Sorry, I think I cannot share the beam search code used in the paper, and it might not be optimized for such a situation.)
Although I cannot provide the exact numbers and checkpoints now, the BiMeanVAE model without Coop (i.e., SimpleAvg) shows nice summarization performance wrt ROUGE scores on the SPACE corpus when I internally tested. So we may not need to apply Coop for the Space corpus to get a reasonable opinion summarizer, but I didn't test the performance with Coop in this setting.
Hi,
I'd like to train COOP on other datasets, e.g. SPACE - what format does the dataset need to be in?
Thanks!