sa-tre / satre-team

A project management repo for the SATRE project
4 stars 1 forks source link

Questionnaire summary: Decide process for handling free text (answers from a definable set) #12

Open edwardchalstrey1 opened 1 year ago

edwardchalstrey1 commented 1 year ago

Handling free-text questions where answers come from a definable set

- 7.a Which non-desktop interfaces are important to you?
- 8.a Which programming languages are important to you?
- 9.a Which repositories are important to you?
- 10.a Which commercially licensed software is important to you?
- 23.a Are there sensitivity systems that you think are important or use?

Notes

JimMadge commented 1 year ago

For free text where we expect (lists of) answers drawn from a set, it feels like we should count the responses.

Something like

responses -> join -> split -> flatten -> lowercase -> count -> sort

followed by a human inspection to pick out, for instance, programming languages from other words (and combine synonyms).

>>> from collections import Counter
>>> from itertools import chain
>>> from re import split
>>> responses = ["Python,R,Julia", "R", "Python", "Python and R"]
["Python,R,Julia", "R", "Python", "Python and R"]
>>> Counter(chain.from_iterable(split(',| ', item) for item in responses))
Counter({'Python': 3, 'R': 3, 'Julia': 1, 'and': 1})

That ranking will give us an impression of what is most popular.

We could then look at votes/total responses or similar.

I think this should help for your category 2 and 3.

I think word clouds will be too qualitative to make decisions on (but would be useful for sharing out results).

(What if someone says "not Python"?)

drchriscole commented 1 year ago

There will be some questions (open ended comments) which will need a human interpreter, otherwise an attempt to categorise responses makes sense.

edwardchalstrey1 commented 1 year ago

@JimMadge your idea is roughly what I've been doing for languages. I've also found that coming up with some bespoke rules on eyeballing the data is needed.

So the process is like

  1. some combo of join -> split -> flatten -> lowercase -> count -> sort
  2. extra logic specific to edge cases, rather than attempting a fully automated tested pipeline

Step 2 is dataset specific, and we wouldn't want to do this if we wanted it to be reproducible with new survey data, but this is just being done once, so we don't care

JimMadge commented 1 year ago

Step 2 is dataset specific, and we wouldn't want to do this if we wanted it to be reproducible with new survey data, but this is just being done once, so we don't care

I think not entirely and it would still be good to share what those extra steps are.

We could for example have lists of synonyms e.g. "PyPI == Python Package Index" and how those are applied. It wouldn't be perfect but if we have a few more responses it would probably still work and be a good starting point for further work.

edwardchalstrey1 commented 1 year ago

from @harisood

Survey responses (free text, easily categorisable) example - this isn't actual analysis, just an example of what results could look like:

Programming language support

Summary

The percentage splits of responses to the SATRE survey question of what programming languages should be supported in a TRE.

Detail

Survey Results

Option % Responses
Sorted from largest to smallest percentage
Python x%
R y%
C# z%
... ...

Summary blurb, e.g.: The community strongly favoured support for Python and R, with a variety of other languages less called for

Where

Specification features

Proposal