[NEW FEATURE]Choose which identity matching algorithm has to be run

lduf commented 4 months ago

Description of the desired solution We had a first algo that used the Levenstein distance between different fields. This method had the advantage of comparing all fields, not just the best descriptive value (BDV) and applies weighted comparison. Then we worked on a solution that used phonetics to compare the different items, but this method only compared BDV. For now, it's up to us to choose (in the code) which algo to run. And once we've chosen an algo, that's the one that's run, period. What would be nice is to be able to let the user choose the algo.

In duplicate, we could imagine a dropdown with the algos and the user choosing which algo to run on the data in his team.

Resolution path Top of my head :

Create a List of algorithms (extend the model ListControl)
Associate the class of your algorithm with the algorithm name (in your list)
Update the DB which store the duplicate results to also store the used algorithm (store the uuid not the name 😉)
Create the front with forms.form dropdown component
When the user select the algo to run, update in the DB choosen duplicate algorithm (might be store in the team table (add a new colum : selected duplicate algorithm))
Update the duplicate launcher to check which algo has to be run
Create feature test 😉

This is just a resolution idea that might need to be discussed. Duplicate algorithm should have the same structure (same signature, same return type, ...) so it is not a problem to call one or another.

If the request is associated with a problem, please specify it.

Additional information

github-actions[bot] commented 4 months ago

AI-Assisted Response:

Summary

The issue describes a situation where there is an existing system that allows for the comparison of different fields using two algorithms: one based on Levenshtein distance that compares all fields using weighted comparison, and another that utilizes phonetic comparisons but only compares the Best Descriptive Value (BDV). The current system requires the algorithm to be manually selected in the code, however, the desired solution is to enable users to choose the algorithm dynamically through a user interface.

Desired Solution

The user wants to enhance the system by allowing end-users to select the comparison algorithm of their choice using a dropdown menu in the interface. This feature would be part of a team's settings and could affect how duplicates are identified within the team's data.

Proposed Steps for Implementation

Implement a list of algorithms extending the ListControl model.
Map algorithm classes to their names in the list.
Modify the database (the one containing the duplicate results) used for storing results to include the UUID of the algorithm used.
Add a dropdown component to the front-end using the forms.form dropdown element.
Allow the DB to update the chosen algorithm when a user makes a selection, possibly in the 'team' table by adding a new column for the chosen algorithm.
Modify the duplicate detection launcher to determine which algorithm should be executed based on user selection.
Develop feature tests to verify the new functionality.

Considerations

Algorithms to be used in this feature should conform to the same structure in terms of method signatures and return types for interchangeability.

Issue Category

This is a feature request.

@lduf's Task Clarification Questions

How should the system handle scenarios where the chosen algorithm might not be suitable for certain datasets or fields?
What mechanisms do you envision to ensure that new algorithms conform to the required structure?
Could you detail the scenarios or use cases where you see this dynamic algorithm selection being most beneficial?
Do you have any preference or specification for how the new column in the 'team' table should be structured, particularly regarding data integrity and indexing?
Are there any performance considerations or limitations that should be taken into account when allowing users to switch algorithms on the fly?
Have you considered any fallback mechanisms or user guidance in case an algorithm results in poor duplication detection for their data set?
Would you like the feature to suggest the most suitable algorithm to the user based on certain data characteristics, or should it be entirely a user-driven choice?

create-issue-branch[bot] commented 3 months ago

Branch feature/issue-479-_NEW_FEATURE_Choose_which_identity_matching_algorithm_has_to_be_run created!

vqlion commented 3 months ago

🌍 #479 should implement new test to check :
* same items have high similarity

* two very different items have low similarity

org-SCAN / website