nickduran / align-linguistic-alignment

Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
MIT License
40 stars 12 forks source link

Add riccardo comments #15

Closed nickduran closed 6 years ago

nickduran commented 6 years ago

The following are Riccardo's notes and the lines marked with an "X" have been updated in this feature branch. Not on this list, but now updated, is that I removed Tasa from the spell-check training corpus to just use Gutenberg texts. I have updated the Notebook and comments accordingly.

  1. This function: model_store = prepare_transcripts(... [X] 1.1. This code "hard-codes" the example directory instead of using the parameters above (here modified to match my example) [ ] 1.2. The same code could also provide (in case of failure) a reminder of the specifics required for the dataset (participant, tab, content). [ ] 1.3. One could also allow for the colnames and delimiter to be specified as parameters. [ ] 1.4. For big corpora it might be handy to turn off the output printed out (e.g. via a parameter)

  2. This function: [turn_real,convo_real]= calculate_alignment(.. [X] 2.1. same as 1.1. [X] 2.2. the link to the align_concatenated_dataframe.txt file is broken, since that file is saved within the example folder and not the main one. [ ] 2.3. I get a warning: /Users/au209589/anaconda3/envs/ipykernel_py2/lib/python2.7/site-packages/scipy/spatial/distance.py:644: RuntimeWarning: invalid value encountered in double_scalars dist = 1.0 - uv / np.sqrt(uu * vv) [X] 2.4. in a file I had a child only using one-word sentences and when the file is cleaned, this leads to an error of the function (finding only one interlocutor) in this line: df_B = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[1]] We might put an exception catch there and a meaningful warning to the user. FIX: if len(dataframe) > 0: And it can be replace with if len(dataframe) > 1: # or higher if one thinks it should be higher. [ ] 2.5. What about putting a parameter specifying the minimum number of turns per speaker that have to be present for the analysis to run on that file?

  3. This function [turn_surrogate,convo_surrogate] = calculate_baseline_alignment(... [X] 3.1. same as 1.1. [ ] 3.2. should we add the dyad/condition labels as parameters at the beginning? [X] 3.3. cond and dyad should be better mentioned in the paper as requisites for the filenames > NOW HIGHLIGHT THIS IN THE README FOR THE NOTEBOOK

a-paxton commented 6 years ago

@nickduran — should we try to fix 2.3 before pushing, or is this a limited issue with known parameters for breakdown?