jasp-stats / jasp-issues

This repository is solely meant for reporting of bugs, feature requests and other issues in JASP.
59 stars 29 forks source link

[Feature Request]: Calculate the (Damerau-)Levenshtein distance in JASP #2292

Open MenimadimAnna opened 1 year ago

MenimadimAnna commented 1 year ago

Description

Calculate the (Damerau-)Levenshtein distance in JASP

Purpose

No response

Use-case

No response

Is your feature request related to a problem?

No response

Is your feature request related to a JASP module?

No response

Describe the solution you would like

No response

Describe alternatives that you have considered

No response

Additional context

Sorry, I'm pretty new to statistics and JASP altogether, so I don't quite know how to fill in the boxes above.

What I'm after, is a way of calculating either the string metric the Levenshtein distance or the Damerau-Levenshtein distance in JASP. I was told in the JASP forum that doesn't exist and I could stick my request here.

I know there's an R package that has this (the stringdist() function). The stringdist() function apparently takes two strings as arguments and returns the Levenshtein distance between them.

Would be great if we could have a way of doing that in JASP too. Many thanks!

EJWagenmakers commented 1 year ago

Since there is a way to do this in the R console (see JASP forum), I would think it makes sense to implement this only in the context of a larger set of related functions; otherwise I am not sure where to put this particular measure...

tomtomme commented 8 months ago

@MenimadimAnna In what kind of analysis do you want to embed this. I read:

"The (Damerau-)Levenshtein distance is useful for various applications that involve comparing or matching strings, such as spell checking (OCR), plagiarism / fraud detection, DNA analysis, and natural language processing12345 Also qualitative text analysis where this distance can be applied, to measure the similarity or difference between texts, words, or concepts.

If it is for fraud detection it might fit in JASPs Audit module

MenimadimAnna commented 8 months ago

Hi

Sorry for the late reply. JASP doesn't have the R package stringdist available so it doesn't seem like I can do this in the R console. And anyway, the Levenshtein distance does not include transposition of adjacent characters, like the Damerau-Levenshtein distance does.

I am doing this in a linguistics research project where I want a metric for the distance between the standardised spelling of a set of words and the way my participants spelt the same set of words. So, for instance the standardised way of spelling the word for trousers in Norwegian is , but participants have spelt it as boksa, boxa, eds, bocse, bøkse, boukse. You get the gist.

Would the JASP Audit module be relevant in this case?

I haven't checked yet whether there are any actual examples of transposition in the dataset, but theoretically, it makes sense that there would be instances of it. edit There are at least two instances of transposition in the data that need to be taken into account.

tomtomme commented 8 months ago

@MenimadimAnna This sounds like text analysis to me and would be a new stand-alone module. Text Analysis in general is tracked here: https://github.com/jasp-stats/jasp-issues/issues/2398 I can give no timeline on this, sorry.

MenimadimAnna commented 8 months ago

Thank you. I see, so it's currently not possible to compute in JASP then.