Understand features observations

samehkamaleldin commented 8 years ago

I did manage to extract feature using the configuration file approach, and they are stored into two file training_matrix.tsv and test_matrix.tsv.

I've extracted a random observation from the training_matrix.tsv file of different relations but I couldn't understand the feature representation what does it mean.

this is an example feature from relation concept:actorstarredinmovie, the following is only one line observation I took it into a new file tried to decompose it into a set of feature to understand what features are like, but I couldn't

concept:sportsteam:purdue_university,concept:sportsleague:ncaa  -1      
-generalizations-_generalizations-,1.0 -#- 
ANYREL:-@ANY_REL@-_generalizations-,1.0 -#- 

ANYREL: -generalizations-_@ANY_REL@-,1.0                                         -#- 
        -concept:subpartoforganization-concept:teamplaysinleague-           ,1.0 -#- 
        -_concept:superpartoforganization-generalizations-_generalizations- ,1.0 -#- 

ANYREL:-_concept:superpartoforganization-@ANY_REL@-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-_@ANY_REL@-concept:teamplaysinleague-,1.0 -#- 
ANYREL:-_@ANY_REL@-_concept:leagueteams-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-_concept:leagueteams-_@ANY_REL@-@ALIAS@-,1.0 -#- 
ANYREL:-concept:subpartoforganization-_@ALIAS@-@ANY_REL@-,1.0 -#- 
ANYREL:-concept:subpartoforganization-_@ANY_REL@-,1.0 -#- 
ANYREL:-concept:subpartoforganization-@ANY_REL@-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-@ANY_REL@-,1.0 -#- 
ANYREL:-_@ANY_REL@-generalizations-_generalizations-,1.0 -#- 
ANYREL:-concept:subpartoforganization-@ANY_REL@-_generalizations-,1.0 -#- 
ANYREL:-@ANY_REL@-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-concept:subpartoforganization-_@ANY_REL@-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-concept:subpartoforganization-_@ANY_REL@-@ALIAS@-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-concept:teamplaysinleague-_@ANY_REL@-@ALIAS@-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-_@ANY_REL@-,1.0 -#- 

ANYREL: -_concept:superpartoforganization-_@ANY_REL@-@ALIAS@-,1.0 -#- 
        -_concept:superpartoforganization-concept:teamplaysinleague-,1.0 -#- 
        -_concept:superpartoforganization-_concept:leagueteams-_@ALIAS@-@ALIAS@-,1.0 -#- 
        -concept:subpartoforganization-_concept:leagueteams-_@ALIAS@-@ALIAS@-,1.0 -#- 

ANYREL:-@ANY_REL@-concept:teamplaysinleague-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-concept:teamplaysinleague-_@ALIAS@-@ANY_REL@-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-_@ALIAS@-@ANY_REL@-,1.0 -#- 
ANYREL:-_@ANY_REL@-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-@ANY_REL@-generalizations-_generalizations-,1.0 -#- 
-concept:subpartoforganization-_concept:leagueteams-,1.0 -#- -_concept:superpartoforganization-_concept:leagueteams-,1.0 -#- 
ANYREL:-concept:subpartoforganization-concept:teamplaysinleague-_@ALIAS@-@ANY_REL@-,1.0 -#- 
ANYREL:-concept:subpartoforganization-generalizations-_@ANY_REL@-,1.0 -#- -concept:subpartoforganization-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-@ANY_REL@-_generalizations-,1.0 -#- -concept:subpartoforganization-generalizations-_generalizations-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-generalizations-_@ANY_REL@-,1.0 -#- ANYREL:-concept:subpartoforganization-_concept:leagueteams-_@ALIAS@-@ANY_REL@-,1.0 -#- 
ANYREL:-concept:subpartoforganization-@ANY_REL@-,1.0 -#- -_concept:superpartoforganization-_@ALIAS@-@ALIAS@-,1.0 -#- ANYREL:-_@ANY_REL@-_concept:leagueteams-,1.0 -#- 
ANYREL:-concept:subpartoforganization-concept:teamplaysinleague-_@ANY_REL@-@ALIAS@-,1.0 -#- 
ANYREL:-@ANY_REL@-_concept:leagueteams-,1.0 -#- ANYREL:-_concept:superpartoforganization-_concept:leagueteams-_@ALIAS@-@ANY_REL@-,1.0 -#- 
ANYREL:-@ANY_REL@-_concept:leagueteams-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-_@ANY_REL@-concept:teamplaysinleague-_@ALIAS@-@ALIAS@-,1.0 -#- -concept:subpartoforganization-concept:teamplaysinleague-_@ALIAS@-@ALIAS@-,1.0 -#- -_concept:superpartoforganization-concept:teamplaysinleague-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-concept:subpartoforganization-_concept:leagueteams-_@ANY_REL@-@ALIAS@-,1.0 -#- 
ANYREL:-@ANY_REL@-concept:teamplaysinleague-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-_concept:superpartoforganization-_@ANY_REL@-_@ALIAS@-@ALIAS@-,1.0 -#- 
ANYREL:-_concept:teamalsoknownas-@ANY_REL@-_generalizations-,1.0 -#- 
ANYREL:-concept:teamalsoknownas-@ANY_REL@-_generalizations-,1.0 -#- 
ANYREL:-concept:teamalsoknownas-generalizations-_@ANY_REL@-,1.0 -#- 
ANYREL:-concept:teamplaysincity-generalizations-_@ANY_REL@-,1.0 -#- 
ANYREL:-_concept:teamalsoknownas-generalizations-_@ANY_REL@-,1.0 -#- -concept:teamalsoknownas-generalizations-_generalizations-,1.0 -#- 
ANYREL:-_concept:citysportsteams-generalizations-_@ANY_REL@-,1.0 -#- 
ANYREL:-concept:teamplaysincity-@ANY_REL@-_generalizations-,1.0 -#- 
ANYREL:-_concept:citysportsteams-@ANY_REL@-_generalizations-,1.0 -#- -concept:teamplaysincity-generalizations-_generalizations-,1.0 -#- -_concept:citysportsteams-generalizations-_generalizations-,1.0 -#- -_concept:teamalsoknownas-generalizations-_generalizations-,1.0

what is -generalization and _generalization ?
what is the value 1.0 the is redundant in the training observation ?
what is -#- ?
what is the feature separator is it ? is it ANYREL: that defines new feature ?

If the following is a full valid feature (as I assume)

ANYREL: -_concept:superpartoforganization-_@ANY_REL@-@ALIAS@-,1.0 -#- 
        -_concept:superpartoforganization-concept:teamplaysinleague-,1.0 -#- 
        -_concept:superpartoforganization-_concept:leagueteams-_@ALIAS@-@ALIAS@-,1.0 -#- 
        -concept:subpartoforganization-_concept:leagueteams-_@ALIAS@-@ALIAS@-,1.0 -#-

what does it represent ?

samehkamaleldin commented 8 years ago

All of my question have been answered after visiting Section §5.1 from your EMNLP 2015 paper

[x] what is -generalization and _generalization ?
[x] what is the value 1.0 the is redundant in the training observation ?
[x] what is -#- ?
[x] what is the feature separator is it ? is it ANYREL: that defines new feature ?

matt-gardner commented 8 years ago

Just be sure you answered the questions correctly, and in case anyone else stumbles upon this issue, the answers to your questions are these:

The hyphen separates edge types in a path. The underscore denotes an inverse (i.e., walking backwards across an edge of a particular type). (And generalization is NELL's encoding of type relationships, i.e., encoding that Barack Obama is a person would use the edge type generalization.)
The value 1.0 is the feature value. PRA computes probabilities associated with each feature, whereas SFE just uses binary indicator features, so all of the feature values are 1.0.
-#- is the feature separator.
The feature separator is given above. ANYREL: is a prefix denoting the type of a particular feature (other prefixes are SOURCE:, TARGET:, and VECSIM:).

And, in general, you can also just look at the code that generates the feature matrix (the link goes to outputFeatureMatrix, in case the line number changes in the future).

samehkamaleldin commented 8 years ago

This is a part weights.tsv file generated by training LogisticRegressionModel for features from dataset wordnet:WN18 for the relation [ _has_part ]

-__part_of-     5.084440955701115
-__hyponym-_has_part-__hypernym-        1.7542840209412862
ANYREL:-_@ANY_REL@-     1.3409053287843857
-_hypernym-_hyponym-_hyponym-   1.0788205509200834
-__part_of-_part_of-__part_of-  1.045527586769945
ANYREL:-_hypernym-_@ANY_REL@-_hyponym-  1.0418866564174691
-_member_of_domain_topic-__part_of-__hypernym-  1.0258116342225538
-_member_of_domain_topic-_has_part-__hypernym-  1.0258116342225538
ANYREL:-_has_part-_hypernym-@ANY_REL@-__hypernym-       1.0040495743253814
ANYREL:-__hyponym-_hypernym-@ANY_REL@-_hyponym- 0.944886623292448

I find it strange that the relation _has_part itself appears in the feature set ! which I didn't expect to happen. According to what I understand the feature extraction doesn't take the targeted relation into account, but here it does.

And here how I generate the data and do the feature extraction

val SFE_SPECS =
    """
      {
          "type": "subgraphs",
          "path finder": {
            "type": "BfsPathFinder",
            "number of steps": 2
          },
          "feature extractors": [
            "PraFeatureExtractor",
            "AnyRelFeatureExtractor"
          ],
          "feature size": -1
      }
    """
val relationName = "_has_part"
val negativeExampleSelector = new PprNegativeExampleSelector(params \ "negative instances", graph, outputter)
val data_with_negatives = negativeExampleSelector.selectNegativeExamples(data, possibleSources, possibleTargets)
val sfeGenerator = new NodePairSubgraphFeatureGenerator(
            parse(SFE_SPECS) ,
            relationName,
            RelationMetadata.empty,
            outputter
          )
val trainingMatrix = sfeGenerator.createTrainingMatrix(data_with_negatives)

Is it normal that the target relation appears in the feature set ?
Why are some features are not prefixed with ANYREL ? what is the type of these features and what do they represent ?

matt-gardner commented 8 years ago

Consider the relation WriterWroteBook. Say you have two facts, (JK Rowling, wrote, Harry Potter 1), and (JK Rowling, wrote, Harry Potter 2), and that you know (Harry Potter 1, sequel, Harry Potter 2). The path that goes (JKR --- wrote ---> Harry Potter 1 --- sequel ---> Harry Potter 2) is incredibly informative for predicting that JKR wrote HP2. Similarly, the path [wrote, _sequel] is informative for predicting that JKR wrote HP1. The trouble is that if both of these are training examples, and you remove both of them when extracting features, you can't use either feature. So what I do is some fancy footwork in the code so that an edge is only excluded if it's the current training (or testing) edge, so the (JKR, HP1) example can use the (JKR, HP2) edge, and vice versa. So, yes, you will see features for _has_part that contain _has_part in a longer path, but this is not cheating, because it is using other known instances of _has_part. If it was actually using the training or testing edge, you should see basically perfect accuracy from the classifier.

samehkamaleldin commented 8 years ago

Why are some features are not prefixed with ANYREL despite I'm using AnyRelFeatureExtractor? what is the type of non-prefixed features ?

I expect that all features are paths starting from the source node but not necessary end at the target node, is that right ?

matt-gardner commented 8 years ago

You are using both the AnyRelFeatureExtractor and the PraFeatureExtractor. Features that have no prefix come from the PraFeatureExtractor, and are paths connecting the source node to the target node.

Also, it appears that _part_of is the inverse of _has_part. Is this true? If it is, you need to specify that, or your experiment will not be correct. The model will learn that when the inverse is present, it should predict an edge, and when the inverse isn't present, it won't predict an edge. Then, at test time, you'll either be cheating by using the inverse, or you'll predict nothing.

To fix this, either remove the _part_of relation entirely (if it really just duplicates the _has_part relation, there is no point to having it in the graph, it will just slow down the code and make learning harder), or specify the inverse relationship in the relation metadata. The first option is definitely preferred.

samehkamaleldin commented 8 years ago

Your comments answers the questions, So I think this should be closed now

SimonZhao777 commented 6 years ago

Hi @matt-gardner , I read the explanation above and it was clear, but in my weights.tsv there appears to be this "-@ANY_REL@-" simbol either inside or at end of a path feature, can you explain what does this symbol indicates? here is part of the weights.tsv: ANYREL:-phone_of_u1-phone_of_u1-_cert_of_u2-@ANY_REL@- 1.3428828694307187 ANYREL:-_u1_payto_u2-_u1_payto_u2-@ANY_REL@-cert_of_u1- 1.0217711921277646 ANYREL:-card_of_u1-_card_of_u1-cert_of_u2-@ANY_REL@- 0.8942525084728918 ANYREL:-_cert_ofu1-@ANY_REL@-card_of_u1-_u1_payto_u2- 0.7607560172498669 -_cert_of_u1-_cert_of_card-card_of_u1-_u1_payto_u2- 0.7607560172498669 ANYREL:-_cert_of_u1-_cert_of_card-card_ofu1-@ANY_REL@- 0.7607560172498669 ANYREL:-_cert_of_u1-_cert_of_card-@ANY_REL@-_u1_payto_u2- 0.7607560172498669 ANYREL:-@ANY_REL@-phone_of_u1-_dev_of_u1-dev_of_u1- 0.684886880419521 ANYREL:-@ANY_REL@-phone_of_u2-_cert_of_u2-cert_of_u1- 0.6784595039383237 ANYREL:-_cert_of_u2-@ANY_REL@-u1_payto_u2- 0.675731779371918 -geo_of_u1-geo_of_u1- 0.6612471349346711 ANYREL:-@ANY_REL@-geo_of_u1- 0.6612471349346711 ANYREL:-geo_of_u1-@ANY_REL@- 0.6612471349346711 ANYREL:-phone_of_u2-@ANY_REL@-card_of_u1- 0.6527712367035824 -phone_of_u2-phone_of_card-card_of_u1- 0.6527712367035824 ANYREL:-phone_of_u2-phone_of_card-@ANY_REL@- 0.6527712367035824 -_phone_of_u2-_phone_of_card-cert_of_card-_cert_of_u1- 0.64549966517749 ANYREL:-_phone_ofu2-@ANY_REL@-cert_of_card-_cert_of_u1- 0.64549966517749

matt-gardner commented 6 years ago

See section 5.1 here: http://rtw.ml.cmu.edu/emnlp2015_sfe/paper.pdf.

matt-gardner / pra

Understand features observations #14