org-SCAN / website

Site web du projet
6 stars 0 forks source link

[NEW FEATURE]Choose which identity matching algorithm has to be run #479

Open lduf opened 4 months ago

lduf commented 4 months ago

Description of the desired solution We had a first algo that used the Levenstein distance between different fields. This method had the advantage of comparing all fields, not just the best descriptive value (BDV) and applies weighted comparison. Then we worked on a solution that used phonetics to compare the different items, but this method only compared BDV. For now, it's up to us to choose (in the code) which algo to run. And once we've chosen an algo, that's the one that's run, period. What would be nice is to be able to let the user choose the algo.

In duplicate, we could imagine a dropdown with the algos and the user choosing which algo to run on the data in his team.

Resolution path Top of my head :

This is just a resolution idea that might need to be discussed. Duplicate algorithm should have the same structure (same signature, same return type, ...) so it is not a problem to call one or another.

If the request is associated with a problem, please specify it.

Additional information

github-actions[bot] commented 4 months ago

AI-Assisted Response:

Summary

The issue describes a situation where there is an existing system that allows for the comparison of different fields using two algorithms: one based on Levenshtein distance that compares all fields using weighted comparison, and another that utilizes phonetic comparisons but only compares the Best Descriptive Value (BDV). The current system requires the algorithm to be manually selected in the code, however, the desired solution is to enable users to choose the algorithm dynamically through a user interface.

Desired Solution

The user wants to enhance the system by allowing end-users to select the comparison algorithm of their choice using a dropdown menu in the interface. This feature would be part of a team's settings and could affect how duplicates are identified within the team's data.

Proposed Steps for Implementation

Considerations

Algorithms to be used in this feature should conform to the same structure in terms of method signatures and return types for interchangeability.

Issue Category

This is a feature request.

@lduf's Task Clarification Questions

  1. How should the system handle scenarios where the chosen algorithm might not be suitable for certain datasets or fields?
  2. What mechanisms do you envision to ensure that new algorithms conform to the required structure?
  3. Could you detail the scenarios or use cases where you see this dynamic algorithm selection being most beneficial?
  4. Do you have any preference or specification for how the new column in the 'team' table should be structured, particularly regarding data integrity and indexing?
  5. Are there any performance considerations or limitations that should be taken into account when allowing users to switch algorithms on the fly?
  6. Have you considered any fallback mechanisms or user guidance in case an algorithm results in poor duplication detection for their data set?
  7. Would you like the feature to suggest the most suitable algorithm to the user based on certain data characteristics, or should it be entirely a user-driven choice?
create-issue-branch[bot] commented 3 months ago

Branch feature/issue-479-_NEW_FEATURE_Choose_which_identity_matching_algorithm_has_to_be_run created!

vqlion commented 3 months ago

🌍 #479 should implement new test to check :

* same items have high similarity

* two very different items have low similarity