Closed asahi417 closed 3 years ago
Thanks for your interest in our paper! The main reason we excluded proper nouns was because when we included them the trigger search would sometimes "overfit" by selecting the gold object or tokens very similar to the answer, especially for T-REx relations that were heavily skewed towards one gold object. For example, relation P30 (is located in) contained mostly of triplets where "Antarctica" was the object and when we generated prompts for that relation, "Antarctica" or "Antarctic" would pop up frequently. We wanted to evaluate if LMs could correctly fill-in-the-blank without using prompts that could potentially give the answer away to the model. So to answer your question, the act of excluding proper nouns and gold objects depends on your goal. If you are working off of LAMA, you'd probably also want prompts that are more "generalized". In terms of whether excluding proper nouns affected performance metrics of the prompts, our findings were inconclusive as some relation-specific prompts improved in test P@1 when proper nouns were filtered out, others decreased, and some precision scores stayed the same.
Thanks for sharing the result and that's very interesting indeed. So, it might be safe to evaluation on validation set whether we should use the filter.
Do you also have any advices to determine the number of iteration? Actually, I'm curious about how the value in each experiment (--iters
argument) was decided. Is the behavior stable after certain steps or would it overfit and decrease the accuracy for longer iteration sometime? I may come up with a brute force approach that to save the result every iteration and choose one with the best validation accuracy in the end, but if the accuracy gets improved approximately monotonically every step, I can just take the last result after fairy long iterations.
We determined the number of iterations via trial and error. When I experimented with large values for the number of iterations (e.g. 1000+) for fact retrieval and relation extraction tasks, sometimes the trigger search would settle/stabilize on the best prompt early and fail to find a better one the rest of the search. This is because the gradient-based prompt search algorithm's objective is to maximize the label likelihoods, so there wouldn't be a decrease in performance with increase in iterations.
Hi, I really enjoyed reading the paper and appreciate for making the code publicly available! I have a quick question about the filtering part where you exclude proper nouns and label token from the candidate vocabulary.
Would it degrade prompt quality without that filtering, let's say in LAMA for instance? I'm thinking to adapt AutoPrompt in our setting where the loss function is a bit different from what shown in the paper and wondering whether I need this filtering or not.