Developmental curves report not working (404 error)

juliacarbajal commented 5 years ago

This page is not working: http://metalab.stanford.edu/reports/developmental_curves.html

amsan7 commented 5 years ago

The rendering seems to fail at this chunk: https://github.com/langcog/metalab2/blob/master/reports/developmental_curves.Rmd#L67.

with the error: all(weights >= 0) is not TRUE

the rmd seems to not have changed so maybe the data did. fyi @juliacarbajal to debug locally it's helpful to run the main_builder.R script locally, as I explain in this comment: https://github.com/langcog/metalab2/issues/76#issuecomment-411515481

juliacarbajal commented 5 years ago

Ok I found the bug: row 27 from the Statistical Segmentation MA for some reason has a correlation corr > 1 which causes d_var_calc to be negative, which is wrong. I checked and luckily this is the only row in the whole of Metalab to have this problem. It can be fixed by either discarding rows with negative variance, or restricting corr to live in [-1;1]. I will try to investigate why it happened that corr ended up being >1, this might be a problem with the formula for calculating corr and should be addressed.

juliacarbajal commented 5 years ago

This seems to be the problem: correlations are imputed in cache_datasets.R before running compute_es.R: https://github.com/langcog/metalab2/blob/674af269a0ab17c5d29317be8064c6b4c909a9d4/scripts/cache_datasets.R#L106-L114

Now the thing is that only empty values will be imputed, while rows that do have a value (even if the value is incorrect) are left as they appear in the spreadsheet. In the case of row 27 from StatSegm, the correlation value above 1 was already written down like that in the spreadsheet in the Drive. So because it had a value, it did not impute it. Then it enters the compute_es.R script, in which the script realises that the value of corr is incorrect, but all it can do is to decide to replace it by corr_imputed, which is this same value. So it stayed the same and caused the subsequent errors (negative variance and crashing the developmental curves report).

I will modify cache_datasets.R to check first for incorrect values also in the spreadsheets (using the same criteria as defined in compute_es.R, here: https://github.com/langcog/metalab2/blob/674af269a0ab17c5d29317be8064c6b4c909a9d4/scripts/compute_es.R#L63-L67) so that these values are imputed before entering compute_es.R

juliacarbajal commented 5 years ago

I implemented the change but I have a doubt: I noticed that there were other MAs that had corr values outside the defined range of (0.01:0.99), particularly negative values. For instance there's one row with a negative corr in the AGL_soundsDL MA, and another one in the Abstract Rule Learning MA. With the change I recently made, these rows now get an imputed correlation instead of using the one in the spreadsheet. This causes a small problem: as the imputation is done by randomly (although with a seed) picking the missing values from the list of observed correlations, when the list of possible values to draw from changes (e.g. when removing a row), all of the previously imputed rows will now get a different value than the one that they had before the change. Since the correlation is used to compute the ES, this affects not just the ES of the one row that changed, but also of all the rows in that dataset that had an imputed value. And this in turn affects (although not greatly) the results of the MA. Because of the way the imputation is done (even without my last change to the code) this is bound to happen every time a dataset changes, as long as there are imputed values in it. Is this problematic?

shotsuji commented 5 years ago

Thanks for implementing the changes! I have two points to make: First, I just looked at the negative correlations in these two MAs. I am pretty sure that we could just recode them as positive - in both cases curators seem to have gotten the values from authors, and though corr might have been negative that doesn't matter to our specific computation. (in fact we might want to add this detail to codebook/specs). I suggest you check up with curators whether they agree to this change and then implement it. Second, to your Q about all ES (or rather, their weighting) changing dependent on the imputed corrs. That's absolutely right. My opinion is that it'd be ok for the imputed values to change each time a new row without reported corrs is added, since we work under the assumption that imputation is random and should not qualitatively affect results. One thing I've done in the published sound symbolism paper is to rerun the imputations 100 times to make sure the results are stable. We could consider adding a little tutorial/code section on how to do this to the website (along with a disclaimer/warning that what you describe might happen).

juliacarbajal commented 5 years ago

So about the negative correlations: first I thought this variable could take any value from -1 to 1, which is why I mentioned this range in an earlier comment. I finally implemented the positive range only because that is what I found in the compute_es.R script. I was actually wondering about that, since having negative values, as long as they're not below -1, does not produce any errors nor results in negative variances or anything. However, I don't think changing the corr value from negative to positive would be a good idea though, as the formula is not symmetric in this respect (that is, the sign does affect the result). For example, in the "t_two" ES method defined in compute_es, you have: wc <- sqrt(2 * (1 - corr)) d_calc <- (t / sqrt(n_1)) * wc

So if you have t = 2, n_1 = 20 and corr = -0.2, you get d_calc = 0.69, while with a corr = +0.2, you get d_calc = 0.57. That's a .1 difference in d, it's quite a lot in my opinion. I think it might be better to simply accept any value within [-1;1], both in cache_datasets and in compute_es. Unless there is some reason the corr used in the ES computations are always assumed positive?

About the second point, I agree, the changes are not big actually so perhaps it's ok to accept it will change from time to time, especially as the dataset evolves. The idea of running it 100 times sounds good. Can you add that to a separate issue? :)

shotsuji commented 5 years ago

Hi Julia, The correlation should be between 0 and 1, since we want to express how strongly the pre-and post-scores are related (no matter the directionality). (@christinabergmann just to be sure I'm not totally getting something wrong here can you confirm corr should be between [0;1]) The correlated samples section of this paper is really good! But: Good catch to point out that negative corrs ARE reported and our script does not automatically turn them into absolute values (I just checked -- guilty myself in my own DBs!!) So we might want to add this conversion, pre-impute, to the ES_calc scripts!

juliacarbajal commented 5 years ago

Oh I see! Ok in that case yes, values in the corr column should be converted to abs() before running the imputation. Should I go ahead and do that or do I wait for Christina to confirm?

shotsuji commented 5 years ago

If you're on it anyways please go ahead - I'm pretty sure about it :) Thanks Julia SO good to have you on the team for a bit!

juliacarbajal commented 5 years ago

Done!

christinabergmann commented 5 years ago

Sorry for the late reply, yes, abs is correct, well spotted and fixed!

Am Sa., 10. Nov. 2018, 18:47 hat Julia notifications@github.com geschrieben:

Done!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/langcog/metalab2/issues/98#issuecomment-437602386, or mute the thread https://github.com/notifications/unsubscribe-auth/ALVmk8_5T2c6VsiwccNkHHC1nixufjoSks5utxEtgaJpZM4YSEJe .

langcog / metalab2

Developmental curves report not working (404 error) #98