voicesauce / opensauce-python

Voice analysis software (Python port of VoiceSauce)
Apache License 2.0
53 stars 16 forks source link

Design of CLI: looking for "user stories" #7

Open bitdancer opened 8 years ago

bitdancer commented 8 years ago

So, I think I understand the basics of how the python code fits together. I think I might as well start by refining the command line interface, since that will improve the usabiliy of the project.

To confirm my understandingl: in general someone will start with a set of WAV files. These files may contain "markup", using a format whose name I don't remember and will need more information on (I believe this isn't actually implemented yet?). (Is this the textgrid?) The user then runs an analysis, producing some output files. The current process doesn't seem to write any output files, so I don't actually know what those are supposed to look like yet, and probably need some additional information on that as well. Currently the analysis process is driven by a combination of a 'parameters' file, which indicates which measurements to compute and a 'settings' file, which provides parameters and thresholds for the measurements.

I presume that in voicesauce these parameters and settings were adjusted in a GUI before each processing run, with some sort of persistent storage of whatever was set.

I presume the model behind the current CLI is that the user will have a set of WAV files to process, and will set up parameters and settings for the goal of that analysis, and run process....and then what? Use the output for further analysis? Tweak the parameters and settings and run it again? Run the same analysis on different input files? Most likely all of the above at one time or another?

What I'm looking for here is what the software development community calls "user stories": a short description of various scenarios in which the project might get used. For example:

"Kristine has a set of WAV files of people saying XXXX. She wants to compare YYYY between the files, and then correlate that with ZZZZ. To do this she will select measurements A, B, and C, constrained by settings M, N, and P, run the analysis, and inspect the output file(s)."

The above is very rough, I don't care about the details I've represented by letters, what I want to know about is what the various possible data, goals, and workflows are and what the project needs to allow the user to do in order to acheive those goals. I'm thinking there are probably multiple scenarios, and that there are pieces missing from the above simple example (like marking up the WAV files). In particular talk about what it is you need to do with the output of the processing run once you have it (perhaps we want multiple output formats?), or possible sources of input if a simple directory of WAV files is not the only way the user might want to organize the input data.

So, please supply me with some user stories, and I'll suggest improvements to the CLI to facilitate them, and then implement them as I go along.

bitdancer commented 8 years ago

I hope that didn't just send a whole barrage of emails while I figured out how to format the user story example...sorry I forgot to use the preview button!

krismyu commented 8 years ago

To confirm my understandingl: in general someone will start with a set of WAV files. These files may >contain "markup", using a format whose name I don't remember and will need more information on (I >believe this isn't actually implemented yet?). (Is this the textgrid?) The user then runs an analysis, >producing some output files. The current process doesn't seem to write any output files, so I don't >actually know what those are supposed to look like yet, and probably need some additional information >on that as well. Currently the analysis process is driven by a combination of a 'parameters' file, which >indicates which measurements to compute and a 'settings' file, which provides parameters and >thresholds for the measurements.

Yes that is correct. Someone starts with a directory of audio files, WAV files. These files aren't marked-up. They can be accompanied by a text file created by a software program called Praat that is used for analysis of speech files; these are called "TextGrid" files and are usually indicated by a file extension ".TextGrid". The text file contains information about important timestamps in the audio file, which are determined by the user. Typically, the timestamps mark important intervals of time in the audio file, like a word, or a vowel or something of interest for analysis; these intervals are also often labeled with some string, e.g. if the interval spans the utterance of the word "man", the label might be the string "man"; the labels are also indicated in the ".TextGrid" file. The information in TextGrid files are read in by a utility function; in the octave version, it looks like that function is here and there is also a python version, it looks like, here.

More on what the TextGrids are used for in user story descriptions.

I presume that in voicesauce these parameters and settings were adjusted in a GUI before each ?>processing run, with some sort of persistent storage of whatever was set.

That's correct; the original Matlab version works like this.

The current process doesn't seem to write any output files, so I don't actually know what those are >supposed to look like yet, and probably need some additional information on that as well.

I've attached a sample output file from the Matlab version. vs-f1-9.txt

As you can see, this is a tab-delimited file (user should be able to set delimiter; here I set tab), with one row per time interval in file analyzed, the the time interval information comes from the TextGrids. Each row gives the filename, the Label (the label of the time interval, extracted from the file's corresponding TextGrid file), seg_Start (the time that the time interval starts), seg_End (the time that the time interval ends), and then a ton of measurements made; for instance H1c_mean, H1c_means001, ...H1c_means009. For this file, the user has requested that the software evenly divide up a time interval into 9 slices, and then take the average H1c over each of these 9 slices. H1c_mean means the overall mean, over the entire time interval; H1c_means001 means the H1c measured in time slice 1.

Here is a first try at some user stories---I'm not sure if this is what you're looking for, so let me know if I'm not providing what you need. I can also provide sample input, and sample output, from the working Matlab version, as an example.

  1. User story 1: user has a set of WAV files from some speaker. User wants to measure various parameters, constrained by settings (such as minimum and maximum pitch considered in estimating pitch, etc.). The user will run the analysis, and then tell voicesauce to output the results to a text file. These results then typically get read into R for statistical analysis.
  2. User story 2: user has a set of WAV files from some speaker and accompanying text grids for each WAV file. User wants to measure various parameters, (user chooses) constrained by settings (such as minimum and maximum pitch considered in estimating pitch, etc.). The user will run the analysis, and then tell voicesauce to output the results to a text file, averaging over each time slice. These results then typically get read into R for statistical analysis.
krismyu commented 8 years ago

I'll make a short video today showing you the flow of input -> output in the old Matlab version, including a sample directory of files; hopefully that'll make things clearer

bitdancer commented 8 years ago

That would be great. You've provided me good information here. Sample input and output files and what settings were used to produce the output from the input (so I can confirm I'm getting the right results) would be most helpful. I'll use those in the unit tests, most likely.

krismyu commented 8 years ago

So here's a short video I made showing the VoiceSauce GUI to give a little more idea of the different possible user stories. You can find it here: https://www.dropbox.com/sh/8r91mzeu3q2ydzx/AABXPsO7lKcSZd2sJ4QQP_Oza?dl=0, along with a directory with a handful of files with accompanying TextGrids that could be played with in unit tests. I selected the files to have a range of voice qualities, "creaky" and "breathy" to try to be representative of the range of voice qualities someone might be trying to analyze.

bitdancer commented 8 years ago

That video gave me a lot of great info. It's too bad it is blurry...I could make out most of the labels but not all of them. I'll have some questions, but I'll ask them when I get to dealing those particular parts.

I don't see any sample output files with the sample input files. Can you run some analysis on them and provide the output files along with what measurements and parameters you used to produce them?

krismyu commented 8 years ago

Here are some screenshots from the GUI to help with blurry pictures in the video:

  1. Parameter estimation window: here's where you select the parameters you want to estimate (two screenshots, to capture parameters at top and bottom of window after scrolling down: image image
  2. Settings window: user inputs values for various arguments to algorithms (some specific to particular algorithms, some used for multiple/all algorithms

image

  1. Output to text window: in Matlab, there is a binary "m" file that is created with results of measurements. This window is then where you make the selections about how you want to output the data to text, like if you want to use TextGrid segmentation or not, how many evenly spaced intervals over a segmented interval in the TextGrid you want to have measurements estimated for, which parameters to include, etc.

image

Here's a list of the parameters that appear in the output to text window that you can ask for in the outputted text file, that I grabbed from the original matlab file func_getoutputparameterlist.m. I think you can ignore the parameters with "Other" (that was supposed to be if the user specified their own measurement algorithm for something, but we'll probably implement that differently). The distinction between parameters that have "c" or marked with asterisk, vs. those marked with a "u", have to do with whether the parameters were "corrected" or "uncorrected" with the function func_correct_iseli_z for different vowel qualities or not; users might want both uncorrected and corrected values.

paramlist = {'H1* (H1c)', ...
             'H2* (H2c)', ...
             'H4* (H4c)', ...
             'A1* (A1c)', ...
             'A2* (A2c)', ...
             'A3* (A3c)', ...
             'H1*-H2* (H1H2c)', ...
             'H2*-H4* (H2H4c)', ...
             'H1*-A1* (H1A1c)', ...
             'H1*-A2* (H1A2c)', ...
             'H1*-A3* (H1A3c)', ...
             'CPP (CPP)', ...
             'Energy (Energy)', ...
             'HNR05 (HNR05)', ...
             'HNR15 (HNR15)', ...
             'HNR25 (HNR25)', ...
             'HNR35 (HNR35)', ...    
             'SHR (SHR)', ...
             'H1 (H1u)', ...
             'H2 (H2u)', ...
             'H4 (H4u)', ...
             'A1 (A1u)', ...
             'A2 (A2u)', ...
             'A3 (A3u)', ...
             'H1-H2 (H1H2u)', ...
             'H2-H4 (H2H4u)', ...
             'H1-A1 (H1A1u)', ...
             'H1-A2 (H1A2u)', ...
             'H1-A3 (H1A3u)', ...          
             'F0 - Straight (strF0)', ...
             'F0 - Snack (sF0)', ...
             'F0 - Praat (pF0)', ...
             'F0 - SHR (shrF0)', ...          
             'F0 - Other (oF0)', ...
             'F1 - Snack (sF1)', ...
             'F2 - Snack (sF2)', ...
             'F3 - Snack (sF3)', ...
             'F4 - Snack (sF4)', ...
             'F1 - Praat (pF1)', ...
             'F2 - Praat (pF2)', ...
             'F3 - Praat (pF3)', ...
             'F4 - Praat (pF4)', ...
             'F1 - Other (oF1)', ...
             'F2 - Other (oF2)', ...
             'F3 - Other (oF3)', ...
             'F4 - Other (oF4)', ...
             'B1 - Snack (sB1)', ...
             'B2 - Snack (sB2)', ...
             'B3 - Snack (sB3)', ...
             'B4 - Snack (sB4)', ...
             'B1 - Other (oB1)', ...
             'B2 - Other (oB2)', ...
             'B3 - Other (oB3)', ...
             'B4 - Other (oB4)', ...
             };
krismyu commented 8 years ago

And here are some output files I generated from running VoiceSauce at default settings for the sample input files, and the output files are located here::

When I say using "XXX" for parameter estimation, where "XXX" might be Straight f0, snack f0, praat f0, or SHR, that means which measurement of fundamental frequency (f0) was used for downstream measurement algorithms that involved f0 (this is set in "Settings"). The parameters pF5, pF6, and pF7 were not measured, and so in the output text file, all values for these parameters are 0 (this is something that can also be set in "Settings", under "Not a number" label). In some use cases, even if the user decides to use one particular f0 measurement algorithm to calculate f0 for downstream measurement algorithms, the user still might sometimes want to estimate f0 using the other algorithms (although probably not usually! Straight f0 in particular takes a long time to run, so if a user is using say praat f0 for downstream, they might not also have straight f0 measured or other f0 measurements, and only select praat f0), so I've included all f0 measurement algorithms in each run of the software, even if only values calculated from only one of the f0 measurement algorithms are used for downstream measurement algorithms.

I always used Snack for formant estimation for these calculations (using Praat is another option--I could also generate output files for those later if desired).

I also aways used TextGrid segmentation labels for the output. (Currently if you uncheck that box in the GUI, the program crashes! We could imagine use cases where a user has no TextGrid though, so it would be good to have that option eventually as well, where the text file outputted would give all the desired measurements measured at the specified frame shift (by default, 1 ms, so there's one measurement estimated every 1 ms), in the audio file.)

  1. output-strf0-1ms.txt. Using Straight f0 for parameter estimation, outputting all parameters to text (except "Other" parameters), using TextGrid segmentation, but with no subsegments. This writes out all the data, one measurement for each parameter per millisecond (see column t_ms), starting with the first labeled interval in the TextGrid. The reason it's every one millisecond is because the frame shift in the "Settings" window was set to 1 ms.
  2. output-strf0-9seg.txt. Using Straight f0 for parameter estimation, using TextGrid segmentation, with 9 evenly spaced sub-intervals ("sub-segments") per labeled interval. Means are calculated over each of the 9 subsegments in a labeled interval, as well as a mean over the entire labeled interval.
  3. output-sf0-1ms.txt. Using snack f0 for parameter estimation, no subsegments.
  4. output-sf0-9seg.txt. Using snack f0 for parameter estimation, 9 subsegments.
  5. output-pf0-1ms.txt. Using praat f0 for parameter estimation, no subsegments.
  6. output-pf0-9seg.txt. Using praat f0 for parameter estimation, 9 subsegments.
  7. output-shrf0-1ms.txt. Using shr f0 for parameter estimation, no subsegments.
  8. output-shrf0-9seg.txt. Using shr f0 for parameter estimation, 9 subsegments.