sampled combinations in Yelp Hold-Out

HectorAuvinen commented 2 months ago

Hello,

I'm trying to reproduce the paper results but don't quite understand how sampling new combinations works.

I'm using the Yelp dataset in the Hold-out mode with batch size 8 (used batch size according to the paper). However, in this setting, as the number of seen combinations is smaller (7) than the batch size, the dcg training script tries to sample new combinations. As I understand it, I would only need 1 more combination to train but this does not work. I also tried other values for num_sample_combs but none of them seem to work. Could you explain how you did the sampling in the original experiments? Thank you!

Kind regards, Hector

tqzhong commented 2 months ago

Hi Hector,

In the Hold-out protocol of Yelp dataset, we always set num_sample_combs equal to 2, but it is important to note the need to adjust the value of hyperparameter lambda_s. In our experiments, we found that this value significantly impacts the results and usually requires a grid search. Furthermore, the results presented in the paper are based on a complete traversal of the split conditions under each protocol (with the generated results aggregated before testing). Therefore, in this setup, it is necessary to run through all 8 cases where idx ranges from 0 to 7.

HectorAuvinen commented 2 months ago

Hello,

Thank you for the active support in this repository!

Does the same hold for other datasets in the Hold-Out protocol (regarding num_sample_combs and the lambda_s grid search)? And how does this differ in the other protocols where we have unseen combinations (Few-shot, ACD)?

tqzhong commented 2 months ago

Hi Hector， Thanks for your attention to our work! Generally, for parameter num_sample_combs, it is usually used in the ACD and Hold-Out protocol of the YELP dataset and is typically set to 2. And for parameter lambda_s, we always perform grid search to get the best result for each scenario (different datasets and protocols) and we found that best results are usually achieved when lambda_s is within the range of [0.001, 0.05].

However, for the Few-Shot protocol, the Meta-MCTG framework is not available. The detailed information can be viewed in the LIMITATIONS part of our paper.

As we usually set the batch_size to 8, we only use num_sample_combs in YELP and Mixture datasets where the total number of attribute combinations is 8. As for Fyelp and Amazon, we don't need to use it.

HectorAuvinen commented 2 months ago

Hello,

Thanks for clarifying all of this. And for evaluating each protocol, all of the idx split generations at the last epoch are just concatenated and evaluated for seen and unseen generations?

tqzhong commented 2 months ago

Yes, the generations of all idx are combined based on seen and unseen, and then tested separately.

HectorAuvinen commented 2 months ago

Would you be able to further explain/motivate the use of the loss parameter alpha_s (or point me to the relevant section in the paper)? I don't see a reference to this in the paper (only to the disentanglement loss weight $$\alpha$$). Thank you.

Zhaoyi-Li21 commented 2 months ago

Would you be able to further explain/motivate the use of the loss parameter alpha_s (or point me to the relevant section in the paper)? I don't see a reference to this in the paper (only to the disentanglement loss weight α ). Thank you.

Hey Hector, Thanks for your attention. Do you refer to the hyper-parameter $\lambda_s$ ?

If so, $\lambdas$ is a weight parameter (doing trade-off between the original language modeling training loss $L{train}$ and the meta-learning loss $L_{pseudo}$). Please refer to the Equation(8) in Section 4.1 for detailed explanation and motivation.

HectorAuvinen commented 2 months ago

Would you be able to further explain/motivate the use of the loss parameter alpha_s (or point me to the relevant section in the paper)? I don't see a reference to this in the paper (only to the disentanglement loss weight α ). Thank you.

Hey Hector, Thanks for your attention. Do you refer to the hyper-parameter λ s ?

If so, λ s is a weight parameter (doing trade-off between the original language modeling training loss L t r a i n and the meta-learning loss L p s e u d o ). Please refer to the Equation(8) in Section 4.1 for detailed explanation and motivation.

Hi!

Sorry, I meant to write $\lambda_s$.

Could you clarify why this loss is applied every time we need to do sampling training (number of seen combs < mini_batch)? When running DCG with one of the protocols with Yelp or Mixture, shouldn't the loss just be the one defined in the DCG paper

$\alpha$ loss_support_dis + (1 - $\alpha$) loss_support_lm

Currently, this loss is scaled by $\lambda_s$ every time we need to sample combinations (in the support data phase). If the meta-learning scheme needs to be applied every time we sample new combinations, then I am not sure how the Hold-Out/ACD/Few-shot protocol results were produced (currently trying to reproduce these for Yelp and Mixture where we do not have enough combinations out of the box for batch size 8).

tqzhong commented 2 months ago

Hi Hector,

Overall, the loss function of Meta-MCTG is split into two parts. The first part is the baseline model’s own loss, and the second part is the pseudo compositional loss, which is obtained through compositional sampling (and is also calculated using the baseline’s loss formula). During training, compositional sampling works in two main scenarios.

The first scenario is when there are enough attribute combinations in the training set (more than the mini-batch size). In this case, when sampling the pseudo-batch, you can simply sample directly from the training set based on the constraints (step 3 in Algorithm 2). This is because there’s no risk of a single training batch using up all the attribute combinations in the training set.

The second scenario is when there aren’t enough attribute combinations in the training set (equal to or fewer than the mini-batch size). In this case, you would use the --sample_train and --num_sample_combs. This way, the total number of attribute combinations sampled for the training batch is fixed. For example, under the YELP Hold-Out setting, the total number of attribute combinations is 7. If you set num_sample_combs=2, it means that each time you sample a training batch, it will include at most 2 different attribute combinations. This ensures smooth sampling for the pseudo-batch.

Zhaoyi-Li21 commented 2 months ago

Hi! Please feel free to re-open the issue should here exist any questions unaddressed : )

HectorAuvinen commented 2 months ago

Hi!

I still had some questions about reproducing the results. I consistently get worse results when running dcg_meta.py. I narrowed the experiments down to just doing the "Original" experiments with the two smaller datasets but with Yelp I get ~5 % worse average attribute accuracy and 8-10 higher perplexity. With Mixture the problem is worse and the variance of the results is high. Over multiple runs with different seeds, the perplexity is always over 100 (and sometimes very bad, e.g. 400) compared to the reported 68.44, and the average attribute accuracy is 2-10 % lower.

Did you experience issues like this at any point? Would you have ideas for what might be happening?
Would you be able to say which seeds you used in the experiments? Did you vary both the training and test seeds for each run (that resulted in the reported average score)?
How many runs did you conduct for each protocol and dataset (i.e. how many runs is each entry in the second to last table in this readme https://github.com/tqzhong/CG4MCTG/blob/main/meta-mctg/README.md based on)?

Thank you!

tqzhong commented 2 months ago

Hi！

For the seed settings, we set the seed for all experiments and training to 42, and the seed for testing to 1.
For experiments using the Original protocol, we only run them once (for other protocols, all idx are iterated through).
Potential issues: Ensure that the training and testing hyperparameters are consistent with those in the paper (for hyperparameters related to all datasets in DCG experiments, please refer to Appendix D.1 of the paper).

HectorAuvinen commented 2 months ago

Hello,

I see, thank you!

Would you be able to estimate when you will be publishing the code for the other baselines (specifically the joint-training based methods)?

I also wanted to ask about the configuration for contrastive prefix-tuning. Prefix length is reported in the paper as 10 but there are no references to prefix_mid_size. What value did you use here? And am I correct in assuming that

map_dict is a mapping from the attribute names to a list of possible values for each attribute
label_keys is just a list of the task attribute names

tqzhong commented 2 months ago

Hi,

The code for other baselines is expected to be uploaded in about 2 weeks. Thank you for your patience. We set prefix_mid_size to 512 for all the experiments. And your understanding of these two values is correct.

HectorAuvinen commented 1 month ago

Hello,

Thanks!

I noticed something about the Few-Shot protocol. I ran this protocol with DCG using 2 sample combinations, and saw that all the support losses sloss, lm_sloss and dis_sloss are always 0.0. This seems to be happening because we get support_combs based on the current combinations which will always contain only the seen combinations and thus, the support data training phase is never initiated. Why aren't the support combinations created from all available combinations? Currently, in the few-shot scenario (I'm not sure how this works in ACD and Hold-Out), we only train on the seen data and the pseudo combinations.

tqzhong commented 1 month ago

Hi,

The training method for meta-mctg uses new attribute combinations that can be reassembled from the attribute combinations within the current training batch as the sampling criterion. If all attribute combinations were used for reassembly, it would lose the significance of the pseudo-compositional batch, whose purpose is to fit the generalized data of the current training batch. In the Few-Shot scenario, due to its design principle, it is impossible to sample a corresponding pseudo-compositional batch for any given training batch, which results in the support_loss being zero. Therefore, we have specified in the limitations that the meta-mctg algorithm cannot be applied to the Few-Shot scenario.

HectorAuvinen commented 1 month ago

Hi,

Thanks for the quick response! I overlooked this and realize now that this was specifically discussed in the paper.

Just to mention: I noticed two things in the dcg_meta.py training script. 1) The collate function padding_fuse_fn pads the attention mask with the EOS token (50256) instead of using 0s. The attention mask should only contain 1s for the relevant part of the input and 0s otherwise. 2) The ignore_index used in the CE loss class is set to 0. Shouldn't this be set to the EOS token that is also used as the padding token (token 50256)?

tqzhong commented 1 month ago

Thanks for pointing out these two mistakes! We have already corrected them.

tqzhong / CG4MCTG

sampled combinations in Yelp Hold-Out #3