PullNet subgraph creation/questions

SamsonYuBaiJian commented 3 years ago

Hi, thanks for your interest. The implementation of PullNet uses some google internal tools, so it is a bit hard to open source. Please let me know if you have any questions. I’m happy to help.

On Sep 8, 2020, at 5:42 AM, Wangyinquan notifications@github.com wrote:

Hello Dr. Sun, Thanks for opening source, I've read your another wonderful paper "PullNet: Open domain Question Answering with Iterative Retrieval on Knowledge Bases and Text". Will its code open source as well?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Originally posted by @OceanskySun in https://github.com/OceanskySun/GraftNet/issues/14#issuecomment-689216248

SamsonYuBaiJian commented 3 years ago

Hi there @OceanskySun, I've read your recent paper PullNet, and I'm very curious about how training is done for subgraphs where the answer entities are <T distance away from the question entities on the shortest path(s) (T refers to the number of iterations for the subgraph expansion).

For some QA subgraphs, the maximum distance the question and the answer(s) entities on the shortest path(s) might be <T, if so, is the training terminated early?

And what about inference? Does it always go to T before termination?

Thank you!

SamsonYuBaiJian commented 3 years ago

Also, for the similarity defined as the dot-product of the last-state LSMT representation for the query with the embedding for the relation, it is mentioned that the relation embeddings are looked up from an embedding table.

Are these embeddings fixed and randomly initialised, then perhaps saved? If not, how are they initialised? Thank you!

haitian-sun commented 3 years ago

Hi Samson,

PullNet will run retrieval for T steps. GRAFTNET will be executed on the retrieved graph. Graftnet always runs T steps of convolution. It should figure out itself what path it wants to take.

Relation embeddings is randomly initialized and trained with the model.

Thanks, Haitian

On Jan 25, 2021, at 5:03 AM, Samson Yu Bai Jian notifications@github.com wrote:

Also, for the similarity defined as the dot-product of the last-state LSMT representation for the query with the embedding for the relation, it is mentioned that the relation embeddings are looked up from an embedding table.

Are these embeddings fixed and randomly initialised, then perhaps saved? If not, how are they initialised? Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OceanskySun/GraftNet/issues/19#issuecomment-766700965, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE5XL35NPJTV5TUALZZDPLS3U6W5ANCNFSM4WRPICGQ.

SamsonYuBaiJian commented 3 years ago

Thank you for getting back to me!

"PullNet will run retrieval for T steps." What happens when the maximum distance between the question and answer entities in an ideal question subgraph is <T during training? How is the _classifypullnodes classifier trained for iterations beyond the maximum distance in the subgraph, where there are no positive examples?

"Relation embeddings is randomly initialized and trained with the model." Are these relation embeddings the same as those used for GRAFT-Net during the classify functions, or are they separate?

Thanks, Samson

SamsonYuBaiJian commented 3 years ago

Hi @OceanskySun , to follow up from my previous response, I would like to ask:

1) "PullNet will run retrieval for T steps." What happens when the maximum distance between the question and answer entities in the ground truth question subgraph is <T during training? How is the classify_pullnodes classifier trained for iterations beyond the maximum distance in the subgraph, where there are no positive examples? Will the ground truth labels just be all 0?

2) "Relation embeddings is randomly initialized and trained with the model." Are the relation embeddings used for the LSTM the same as those used for GRAFT-Net during the classify functions, or are they separate?

3) How is the LSTM trained and what loss function is used? Do I first constrain the relations to relevant facts, then input these relations one by one into the dot product, then do BCE loss for each? Or do I input all relations, then do BCE loss for all?

4) Finally, when I get the relevant relations/facts to rank, lets say a certain relation is the top ranked, and there are 10 occurrences for that relation, but my limit of N_f=5, meaning I only choose to retrieve 5 facts, how do I choose between the 10? Is it random?

Thank you for your response!!

haitian-sun commented 3 years ago

Hi there,

Thanks for your question.

We will always run retrieval for T steps. It won't hurt if the oracle reasoning step is less than T. The GNN module will handle that. Distant supervision will not be provided for steps > T. The loss will be masked out.
They are the same.
Classifying over all relations always helps. You should compute the dot product for all relations and then filter those that are connected to the node you would like to expand.
Random at test time, but you may add them manually to the graph at training time.

Please let me know if you have more questions.

Thanks, Haitian

On Feb 24, 2021, at 6:23 AM, Samson Yu Bai Jian notifications@github.com wrote:

Hi @OceanskySun https://github.com/OceanskySun , to follow up from my previous response, I would like to ask:

"PullNet will run retrieval for T steps." What happens when the maximum distance between the question and answer entities in the ground truth question subgraph is <T during training? How is the classify_pullnodes classifier trained for iterations beyond the maximum distance in the subgraph, where there are no positive examples? Will the ground truth labels just be all 0?

"Relation embeddings is randomly initialized and trained with the model." Are the relation embeddings used for the LSTM the same as those used for GRAFT-Net during the classify functions, or are they separate?

How is the LSTM trained and what loss function is used? Do I first constrain the relations to relevant facts, then input these relations one by one into the dot product, then do BCE loss for each? Or do I input all relations, then do BCE loss for all?

Finally, when I get the relevant relations/facts to rank, lets say a certain relation is the top ranked, and there are 10 occurrences for that relation, but my limit if N_f=5, meaning I only choose to retrieve 5 facts, how do I choose between the 10? Is it random?

Thank you for your response!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OceanskySun/GraftNet/issues/19#issuecomment-785008315, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE5XL5JFEGQ2BL2DNCM3XTTATOTNANCNFSM4WRPICGQ.

SamsonYuBaiJian commented 3 years ago

Thank you for your excellent work and help!

SamsonYuBaiJian commented 3 years ago

Hi @OceanskySun , I have a few final questions for you, and would appreciate if you could help with these:

Are the word embedding layers also the same for GRAFT-Net and the LSTM for the similarity function?
In this case, since the max number of words/texts/entities to retrieve are not fixed by the size of the training dataset, how do you set them? Is it set by the user?
Do you prioritise the ground-truth entities you retrieve for teacher forcing during training, e.g. add them first to the list of retrieved entities, or is it randomised?
How often do you train/backpropagate the loss for the two models? Is it for every t in T?
For answer selection, it seems like only one answer entity node (node with greatest probability) is retrieved, but what about the cases in the dataset where there are multiple answers?

Thank you!

haitian-sun commented 3 years ago

Hi there. Thanks for your questions. Please see my answers below.

Are the word embedding layers also the same for GRAFT-Net and the LSTM for the similarity function? Yes. You can also train different word embeddings separately. It shouldn’t matter too much.
In this case, since the max number of words/texts/entities to retrieve are not fixed by the size of the training dataset, how do you set them? Is it set by the user? You can treat them as hyper-parameters and tune them on the dev set.
Do you prioritise the ground-truth entities you retrieve for teacher forcing during training, e.g. add them first to the list of retrieved entities, or is it randomised? Yes, we do always add the ground-truth entities to the graph at training time.
How often do you train/backpropagate the loss for the two models? Is it for every t in T? Yes, for each iteration
For answer selection, it seems like only one answer entity node (node with greatest probability) is retrieved, but what about the cases in the dataset where there are multiple answers? We simply measure the @.*** of the dataset, so we only take the most confident entity as the answer. You may think of take top k instead. Another possible solution is to consider each candidate separately. For example, you can run sigmoid on the logits of the candidates, and then compute the binary cross entropy loss. As a side note, from our experiment, we find softmax is usually easier to optimize.

Thanks, Haitian

On Mar 15, 2021, at 7:08 AM, Samson Yu Bai Jian @.***> wrote:

Hi @OceanskySun https://github.com/OceanskySun , I have a few final questions for you, and would appreciate if you could help with these:

Are the word embedding layers also the same for GRAFT-Net and the LSTM for the similarity function? In this case, since the max number of words/texts/entities to retrieve are not fixed by the size of the training dataset, how do you set them? Is it set by the user? Do you prioritise the ground-truth entities you retrieve for teacher forcing during training, e.g. add them first to the list of retrieved entities, or is it randomised? How often do you train/backpropagate the loss for the two models? Is it for every t in T? For answer selection, it seems like only one answer entity node (node with greatest probability) is retrieved, but what about the cases in the dataset where there are multiple answers? Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OceanskySun/GraftNet/issues/19#issuecomment-799331874, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE5XL6DWAHK7X7XKW2WHRLTDXTC7ANCNFSM4WRPICGQ.

SamsonYuBaiJian commented 3 years ago

Hi @OceanskySun , may I check how the softmax classification is done in your case, when you take the top-1 hits/precision, since there may be multiple answers? Thank you.

haitian-sun commented 3 years ago

@.*** counts if the any of the correct answer is predicted.

On Mar 26, 2021, at 6:28 AM, Samson Yu Bai Jian @.***> wrote:

Hi @OceanskySun https://github.com/OceanskySun , may I check how the softmax classification is done in your case, when you take the top-1 hits/precision, since there may be multiple answers? Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OceanskySun/GraftNet/issues/19#issuecomment-808103067, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE5XL5CN7EQZUIJMPXC7M3TFROTPANCNFSM4WRPICGQ.

SamsonYuBaiJian commented 3 years ago

Hi @OceanskySun, thank you for all your help so far, I realise that there are quite a few important decisions that you have made for training:

What is graph recall? Is it answer entities, answer + intermediate entities, or answer + intermediate entities + facts in ideal subgraph?
How did you improve graph recall for testing, especially for documents-only runs? Which hyperparameters had the most influence?
Did you include the titles of documents for PyLucene when calculating document similarity, and how did you do so (eg. different field)?
What are the most impactful hyperparameters in your opinion?
Is there a classification/recall trade-off? Like if your max local entities is set too high, the classification performance will start decreasing?

Thanks @OceanskySun for all the help so far, you have answered so many of my questions...

haitian-sun commented 3 years ago

Hi,

Thanks for your question.

The graph recall is computed on the number of answers retrieved by the subgraph versus all correct answers.
Retrieving in the document only setting is challenging. We used lucene for the retrieval. We require that the retrieved documents should contain the entity at the node. The retrieved paragraphs are sorted by their lucene score. You can try some better similarity functions, e.g. neural based one.
Yes, we append the title to the beginning of the document.
Think about how many new nodes would you like to retrieve at each step. I am not a fan of hyper-parameter tuning to be honest...
I can’t remember that exactly. The number points to grow is usually set as somewhere between 1 to 10, to maintain a high recall graph.

Thanks, Haitian

On Apr 6, 2021, at 6:16 AM, Samson Yu Bai Jian @.***> wrote:

Hi @OceanskySun https://github.com/OceanskySun, thank you for all your help so far, I realise that there are quite a few important decisions that you have made for training:

What is graph recall? Is it answer entities, answer + intermediate entities, or answer + intermediate entities + facts in ideal subgraph? How did you improve graph recall for testing, especially for documents-only runs? Which hyperparameters had the most influence? Did you include the titles of documents for PyLucene when calculating document similarity, and how did you do so (eg. different field)? What are the most impactful hyperparameters in your opinion? Is there a classification/recall trade-off? Like if your max local entities is set too high, the classification performance will start decreasing? Thanks @OceanskySun https://github.com/OceanskySun for all the help so far, you have answered so many of my questions...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OceanskySun/GraftNet/issues/19#issuecomment-814004448, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE5XL6G52ZJNBCQPUXXKWDTHLNPHANCNFSM4WRPICGQ.

haitian-sun commented 3 years ago

The relation embeddings for r and the LSTM states are trained.

Just a note here that LSTM states and relations embeddings are also used in other places and will be trained there as well.

On May 6, 2021, at 9:06 AM, dinani65 @.***> wrote:

I am missing some information about similarity function as a classifier in the step of building subgraphs. "Similarity is defined as the dot-product of the last-state LSTM representation for q with the embedding for r. This dot-product is then passed through a sigmoid function to bring it into a range of [0,1]: as we explain below, we will train this similarity function as a classifier which predicts which retrieved facts are relevant to the question q." Based on the above part of the paper, the similarity score is calculated based of doc product of two tensors and applying a sigmoid function. I can not understand which part needs to train here. It seems to have a mathematical operation to classification. which part of function needs to be trained?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OceanskySun/GraftNet/issues/19#issuecomment-833506719, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE5XL7P4C36H7BOTCHSU4LTMKH5VANCNFSM4WRPICGQ.

vardaan123 commented 2 years ago

Hi Haitian Sun,

Thanks for the great work! I have some questions regarding the subgraph construction in your work PullNet (also the topic of this issue)

In Sec 3.3, the candidate intermediate entities are defined as the entities which occur on the shortest paths from the question nodes to the answer nodes. It is mentioned that "When we train the classify pullnodes classifier in iteration t, we treat as positive examples only those entities e^{'} that are connected to a candidate intermediate entity e with distance e_t = t+1". Does it mean distance=t+1 from any entity occurring on the gold path? What is the intuition behind this?
In the next sentence, it is mentioned that "This encourages the retrieval to focus on nodes that lie on shortest paths to an answer". This sounds more intuitive to me. If I understand correctly, this means only the nodes that occur on a shortest path from question to answer nodes are considered positive in the whole KG. But this conflicts with my understanding of point 1.
How is the retrieval done at inference time? Is it the set of all nodes with predicted score > epsilon or the top k nodes?

Thanks for your patience! I eagerly look forward to hearing from you.

haitian-sun / GraftNet

PullNet subgraph creation/questions #19