multi-token EmbeddingConstruction

nashid commented 2 years ago

📚 Documentation

I am confused with the EmbeddingConstruction documentation.

    embedding_style:
      single_token_item: false
      emb_strategy: "w2v_bilstm"

If I understand correctly if there are multiple tokens in the node attribute i.e.: graph_data.node_attributes[index]["token"] = "I am multiple tokens", in that case I should set single_token_item as false. Is this understanding correct?

Secondly, why I can't set seq_info_encode_strategy as true for the multi-token items?

nashid commented 2 years ago

@AlanSwift @hugochan can anyone please help me with this query?

hugochan commented 2 years ago

📚 Documentation

I am confused with the EmbeddingConstruction documentation.
    embedding_style:
      single_token_item: false
      emb_strategy: "w2v_bilstm"
If I understand correctly if there are multiple tokens in the node attribute i.e.: graph_data.node_attributes[index]["token"] = "I am multiple tokens", in that case I should set single_token_item as false. Is this understanding correct?

Secondly, why I can't set seq_info_encode_strategy as true for the multi-token items?

@nashid Thank you for your attention to the library! 1) yes, you are correct about the first statement. 2) seq_info_encode_strategy specifies strategies of encoding sequential information in raw text (i.e., each token in raw text as a graph node). If a graph node contains multiple tokens, we think in general it doesn't make sense to encode the raw sequential text to help initialize node embeddings. Can you give concrete cases where you think it makes sense to turn on seq_info_encode_strategy for multi-token node?

nashid commented 2 years ago

@hugochan I am applying the Graph2Seq model to the source code (JAVA).

A single line of source code in a single node.
Different lines are then connected with next-line edge.

I am wondering why not leave it to the user of the library? The users of the library can set seq_info_encode_strategy according to their use case?

hugochan commented 2 years ago

@hugochan I am applying the Graph2Seq model to the source code (JAVA).

A single line of source code in a single node.

Different lines are then connected with next-line edge.

I am wondering why not leave it to the user of the library? The users of the library can set seq_info_encode_strategy according to their use case?

@nashid That's a good question! We want to keep a good balance between providing enough flexibility to users and providing off-the-shelf solutions which are proven to be effective in existing literature. For multi-token node, we think it's a common practice to run a sequence encoder on each node to initialize the node embeddings in many scenarios (e.g., IE graph, knowledge graph). Users are encouraged to build their customized embedding initialization strategy if the built-in options cannot suit their needs.

nashid commented 2 years ago

@hugochan if I use pre-trained embedding (let’s say word2vec or gloVe), I presume I can just feed the pre-trained embedding.

For multi-token node, would that work? Can you please point me to an example, if you have one?

hugochan commented 2 years ago

@hugochan if I use pre-trained embedding (let’s say word2vec or gloVe), I presume I can just feed the pre-trained embedding.

For multi-token node, would that work? Can you please point me to an example, if you have one?

@nashid yes, you can refer to this text summarization example which constructed an IE graph (containing multi-token nodes).

graph4ai / graph4nlp

multi-token EmbeddingConstruction #520

📚 Documentation

📚 Documentation