nfdi4plants / Swate

Excel Add-In for annotation of experimental data and computational workflows.
https://swate-alpha.nfdi4plants.org
MIT License
31 stars 6 forks source link

[BUG] Improve template search functionality #490

Closed shiltemann closed 2 months ago

shiltemann commented 2 months ago

When searching templates in ARCitect, the search results are sometimes suboptimal.

OS and framework information:

Describe the bug Example:

  1. Searching by template name;
    • There are multiple templates in the primary list titled ENA - XXXX
    • Typing ENA in search bar turns up 0 results
    • Typing ENA - in search bar now turns up just one of the results

Screenshots example 1, template search

For several templates named ENA - ... we see there are several in the llist of templates:

image

However when we enter ENA in the search box we get no results:

image

And when we type ENA - in the search bar we get one of them as a result:

image

Brilator commented 2 months ago

@Freymaurer can you move this to Swate?

Freymaurer commented 2 months ago

Hey! Could you pls open two issues for this? As the two problems you describe are not related to each other. Feel free to keep this one for Template search and open another one for term search.

shiltemann commented 2 months ago

done :+1:

Freymaurer commented 2 months ago

The reason behind this behavior is our search algorithm. We use sorensen dice on string bigrams. A lot of fancy words for "we look for similiarity and the more similiar the two strings we compare the higher the score", and to filter out unfit results we apply a threshold. In your example "ENA - " has actually more similiarity to SRA - Sequencing than to the longer ENA names. For example in "ENA - Gene promoter annotated sequence", we have ~30 missmatch characters. In "SRA - Sequencing" we have only 11 missmatch characters. This very flexible calculation allows for semi-similiar result search. To avoid your described issues we know adjust the score as follows:

[!NOTE] Threshold is 0.3

Image