larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

Document test file format #242

Open Ramblurr opened 7 years ago

Ramblurr commented 7 years ago

Using the genetic algorithm to create configurations is very helpful, however it would be nice if the test file format was documented.

Looking at this example: https://github.com/larsga/Duke/blob/dda63901a144a624ed7d68e3a17cc9403d77a70e/doc/example-data/countries-test.txt

It seems the format is

What is the last value?

Ramblurr commented 7 years ago

I did some more diving into the documentation and found the format specified in the Tuning Guide.

However, when I look at the source I see that the testfile is deprecated in favor of a LinkDatabase, however that seems to be not documented. Related to #51 perhaps?