halestudio / hale

(Spatial) data harmonisation with hale»studio (formerly HUMBOLDT Alignment Editor)
143 stars 44 forks source link

Bad performance for Merge with auto-detect enabled #626

Open stempler opened 6 years ago

stempler commented 6 years ago

The auto-detect feature in the Merge configuration seems to lead to a very bad performance during the transformation. It seems this problem was introduced already in version 3.2.0 with this PR: https://github.com/halestudio/hale/pull/285.

This problem occurs if there is a significant amount of instances that are merged together within a Merge, because the comparisons done for the auto-detect is O(n²). An additional factor is if there are many attributes that are compared.

Example from a median sized data set (~100k instances, up to ~1000 instances merged together):

Right now the workaround is to explicitly configure properties in the Merge configuration and leave the auto-detect feature turned off.

thorsten-reitz commented 6 years ago

Add a comment to make clear that auto-merge affects performance negatively.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.