starling-lab / BoostSRL

BoostSRL: "Boosting for Statistical Relational Learning." A gradient-boosting based approach for learning different types of SRL models.
https://starling.utdallas.edu
GNU General Public License v3.0
32 stars 21 forks source link

test file structure #1

Closed achalshah20 closed 6 years ago

achalshah20 commented 7 years ago

In the documentation, it is mentioned that file structure should be following:

background.txt : Modes train/ folder : train_bk.txt : Pointer to the background file. train_facts.txt : Facts train_pos.txt : Positive examples train_neg.txt : Negative examples test/ folder : test_bk.txt : Pointer to the background file. test_facts.txt : Facts test_pos.txt : Positive examples test_neg.txt : Negative examples

But, why do you expect pos and neg examples in test directory? We don't know test sample will be positive or negative!!

hayesall commented 7 years ago

Thanks for the question @achalshah20 , we'll try to clarify this in the documentation.

pos/neg probably make sense during training: we want to fit a decision tree that explains the positive examples while avoiding the negative examples.

During testing, the pos/neg label is hidden until TPR/FPR is calculated for ROC curves. The division is useful for getting a reliable calculation by explicitly saying what is positive or negative.

achalshah20 commented 7 years ago

Thanks. So, if I don't know any label, I will still create 2 files pos and neg and dump all my test cases into one of them, right? I know, I can't rely on FPR/TPR in this case.

hayesall commented 7 years ago

So, if I don't know any label, I will still create 2 files pos and neg and dump all of my test cases into one of them, right?

Exactly. Usually I add them to test_pos.txt, since the output value can roughly be interpreted as "what is the probability of this example being true?"

Would it be possible to divide your training data first? It is usually worth it to get an idea of what the distribution of output values over positive and negative examples is though. After you determine this, you can use the trained model on new data, or train a new model with the entire training set.

hayesall commented 6 years ago

@achalshah20 Is this answer to your satisfaction? I would like to close this issue if it is resolved.

hayesall commented 6 years ago

Setting this issue as resolved.