scify / JedAIToolkit

An open source, high scalability toolkit in Java for Entity Resolution.
http://jedai.scify.org
Apache License 2.0
212 stars 47 forks source link

Null pointer when trying to load data using latest release #19

Open yeikel opened 5 years ago

yeikel commented 5 years ago

I am using the following release

And I am trying the jedaiDesktopApp-1.1.jar with the following datasets (from the samples) :

abtBuyIdDuplicates (for D1) abtBuyProfiles (for truth file)

image

But I get the following error :

image

I tried with CSV files and I also get the same error

gpapadis commented 5 years ago

Hi! The serialized datasets are incompatible with older versions, due to a change in some Java classes. Try using the latest version and let me know if the problem is fixed.

yeikel commented 5 years ago

Hi! I am using the latest release.

I also tried csv files and I received the same errors

leots commented 5 years ago

Hello yeikel,

Can you please tell me which CSV files did you try exactly, so I can test it?

Anyhow, the latest release we have on Github right now (the one you linked above) is not the latest version of the code, so this is why @gpapadis refers to it as an older version.

The current version of the code is still a work in progress, which is why we haven't put up a full "release" on Github yet. However, you can find a build of it here: https://drive.google.com/open?id=1W-ffcQZWnw0MIWluaBzyApsa7nqq5wWB, or build it yourself from the repositories.

This version should read the serialized files directly, and it also allows you to configure some options for how to read the CSV, such as the delimiter, which could fix the problem you are having with the CSV files too.

yeikel commented 5 years ago

@leots Where can I find sample CSV files? And their format?

Unless I am using the wrong files/configuration , I tried the serialized samples included in the documentation but they fail. :

image image

gpapadis commented 5 years ago

The problem with the serialized datasets is that you use the groundtruth file (abtBuyIdDuplicates) in the place of "Entity Profiles D1" and the profiles size as the "Ground-truth file". It should be the other way round. CSV datasets are available here: https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution

leots commented 5 years ago

One more thing that I can see, is that you are selecting clean-clean entity resolution but haven't selected a 2nd entity profiles dataset, so make sure you either select dirty ER or add a 2nd dataset.