megagonlabs / ditto

Code for the paper "Deep Entity Matching with Pre-trained Language Models"
Apache License 2.0
259 stars 89 forks source link

Inconsistent metrics #5

Closed CristhianBoujon closed 4 years ago

CristhianBoujon commented 4 years ago

I'm testing ditto to match my own dataset so I'm running the following: CUDA_VISIBLE_DEVICES=0 python train_ditto.py --task catalogo --batch_size 64 --max_len 64 --lr 3e-5 --n_epochs 10 --finetuning --lm distilbert --fp16

metrics in the output process are really good... around accuracy=0.990.

In order to do a sanity check and a to do analysis error, I run predictions over the test.text (in jsonline format): CUDA_VISIBLE_DEVICES=0 python matcher.py --task catalogo --input_path input/to_be_evaluate.jsonl --output_path output/output_catalogo.jsonl --lm distilbert --use_gpu --fp16 --checkpoint_path checkpoints/

My dataset is balanced (Around 50% are positive and 50% negatives) so based on 0.99 accuracy I expect get almost the same amount positives and negatives in output file running the following commands:

$ ditto# cat output/output_catalogo.jsonl | grep '"match": "1"' | wc -l 
139
$ ditto# cat output/output_catalogo.jsonl | grep '"match": "0"' | wc -l 
5149
$ ditto# cat data/cboujon/test.txt | grep -P "\t0" | wc -l 
2675
$ ditto# cat data/cboujon/test.txt | grep -P "\t1" | wc -l 
2613

These numbers show me that accuracy is not 0.990 or I can't see where is my error. Here is datasets and outputs

Task config is defined:

{
  "name": "catalogo",
  "task_type": "classification",
  "vocab": ["0", "1"],
  "trainset": "data/cboujon/train.txt",
  "validset": "data/cboujon/valid.txt",
  "testset": "data/cboujon/test.txt"
}
oi02lyl commented 4 years ago

I think the commands are okay except that you need to add the --save_model flag in the training. With the flag on, we will have a file named "catalogo_lm=distilbert_da=None_dk=None_su=False_size=None_id=0_dev.pt" which contains the checkpoint with the highest validation F1.

I ran

mv catalogo_lm=distilbert_da=None_dk=None_su=False_size=None_id=0_dev.pt checkpoints/catalogo.pt

then your command of running the matcher. I got

cat output/output_catalogo.jsonl | grep '"match": "1"' | wc -l 
2573
cat output/output_catalogo.jsonl | grep '"match": "0"' | wc -l 
2715

I also eye-balled some of the prediction results and they seemed to be correct.

CristhianBoujon commented 4 years ago

I have run now and metrics it seems to make sense now. But... If I didn't save the model previously... What model did matcher.py process load?

oi02lyl commented 4 years ago

I think it either initializes the model randomly or loads the pre-trained model (not optimized). We should just make it throw an exception. Thanks for pointing this out!

CristhianBoujon commented 4 years ago

Great! You can check #6