neuroelectro / neuroelectro_org

The NeuroElectro Project: Compiling information on neuron electrophysiology through literature text-mining.
neuroelectro.org
GNU General Public License v2.0
13 stars 4 forks source link

Additional sumary stats for low sample size properties using simulated values? #290

Closed JustasB closed 8 years ago

JustasB commented 8 years ago

@stripathy,

The property means and SD's reported in the UI and the API are based on the number of papers for each property. However, many of the papers have their own n's, means, and SD's. The additional per-paper detail could be used to get smaller confidence intervals (CIs) for the property means.

I know that individual paper data can be downloaded for each property value using the API. For properties with low paper counts, the n reported in each paper, mean, and the SD values could be used to generate simulated data points from a distribution with those parameters. Those points could then be pooled together across papers and summary stats of the pooled population computed for each property. Those new stats would have the N=sum(n's of each paper), usually making the CI of the means of each property much narrower.

Including such simulated values would make a really useful optional feature for the website UI and the API -- I understand that not everyone would want to use those values.

Has there been any discussion along these lines? I would use such feature for my research, and could code up a PR to optionally include them in the API calls and the website UI.

Thoughts?

stripathy commented 8 years ago

Thanks for bringing this up @JustasB .

In a recent commit, @dtebaykin and I added a feature to text-mine and curate SD and N info per paper when it's provided by the authors. If you look at the summary spreadsheet here: http://neuroelectro.org/static/src/article_ephys_metadata_curated.csv, you'll see 6 terms per ephys property, for example for rmp: (rmp rmp_raw rmp_err rmp_n rmp_sd rmp_note). The n term is provided by the authors and the sd term is converted from the standard error (err) term provided by the authors.

In terms of the api, we only export the n and error terms (we don't disambiguate whether the error is a standard error or a SD).

I like the idea of using these N and SD parameters to help constrain the property CI. If you'd like to contribute that feature for the API through a pull request, that would be really great. In terms of getting the feature up on the website UI, I've been lax on doing any updates to it but would definitely consider it.

JustasB commented 8 years ago

Ok, sounds good. I should have some time to work on this in a few weeks, and will check in then.

stripathy commented 8 years ago

Thanks again @JustasB . My preference is to help you get you the features you need for your research first and foremost, and then we can consider later adding such features to the neuroelectro codebase and API.

rgerkin commented 8 years ago

@JustasB @stripathy The easiest way to do this is not by actually simulating the new values, but just using textbook formulas for estimating the mean and variance in pooled samples. That formula is to compute w_i for each of the papers, i = {1,2,...,n}, according to w_i = 1/sem_i^2. So the weight given to each paper is the squared reciprocal of the standard error for the mean reported in that paper. Naturally this is a function of sd and sample size, so higher weight is given to reports with lower sd or larger sample size.

Then you computed the new grand mean and se by using the weights in the usual way, i.e.: grand_mean = gm = (w_1*mean_1 + w_2*mean_2) / (w_1+w_2+...) grand_variance = (w_1*(mean_1-gm)^2 + w_2*(mean2-gm)^2+...) / (w_1+w_2+...)

There are then different approaches to turn that grand variance into a grand sem (and got confidence intervals, etc.), discussed here, but the simplest is to do the usual divide by n and take the square root.
i.e. you weight according to the square of the reciprocal of the standard error. Or if you really want to get accurate confidence intervals you can compute the probability distribution of grand_mean directly from the data using Bayes theorem, which would be cool.

JustasB commented 8 years ago

Alright, that would be easier. I also found these:

Can you elaborate on this a bit more: "you can compute the probability distribution of grand_mean directly from the data using Bayes theorem"?

rgerkin commented 8 years ago

@JustasB

Without a prior: (i.e. just using maximum likelihood estimation of the parameters from the data): write out the log-likelihood function: L(data | model) = L(data_1 | model) + L(data_2 | model) + ... where 'model' means the distribution and parameters you are trying to estimate (e.g. normal distribution with mean mu and variance sigma^2), and data_1, data_2, ... are all of the data points from neuroelectro. Each L() is the log of the pdf for your the distribution you are using, with that data point inserted. Then find the values of your parameters that maximize L(data | model).

With a prior: Use L(data | model) + L(model) instead, where L(model) is the log pdf of a prior on your parameters, e.g. the mean input resistance could be gamma distributed with some shape and scale parameters. This can be estimated from the entirety of neuroelectro (instead of just the cells being used in the likelihood function), and is slightly better than not using a prior since not using a prior implicitly assumes that all parameters values are equally likely, a priori, including negative ones, and ridiculously large and small ones.

I will show you in the office.

JustasB commented 8 years ago

@stripathy

For the OB mitral cells, the website shows that there is only one paper that includes the N for each property. The other papers show only the mean+SE, but no N.

http://neuroelectro.org/neuron/129/data/

I'm seeing the same on the api, and on the csv data table

@rgerkin mentioned that you may have the Ns for the other papers, but they might not be curated. Can you let us know?

rgerkin commented 8 years ago

@stripathy API URL for e.g. V_rest: http://neuroelectro.org/api/1/nedm/?n=129&limit=100&e=3. I'm thinking that all the data from Nathan's lab may not have had their N's added to the database.

stripathy commented 8 years ago

@JustasB

You're right - for OB mitral cells, at present there's only 1 paper that includes the N for each property. Other papers don't include the N because this was curated by hand by Shawn Burton, another grad student working in Nathan's lab (you can see this because the Content Source for that data is UserSubmission). In more recent curation pushes, my curators are instructed to try curating the N for each paper, but there hasn't been as much data on OB mitral cells that we've come across.

JustasB commented 8 years ago

@stripathy ok, I see. Well, I could try fetching the N's from each paper for the mitral cell. There are only 23 papers.

If I got the N's, is there some efficient way you could update the db with the values? If so, what format would work best for you?

JustasB commented 8 years ago

@stripathy @rgerkin I went through all the mitral cell articles and got the N's for each property.

I copied the table seen in the following link, and added "N" and "Notes" columns. http://neuroelectro.org/neuron/129/data/

I also found a few values that were missing or did not match the papers. I noted them and also reported them via the Report miscurated data feature.

Take a look at the following XLS: MitralCellNs.xlsx

Please update the DB with the missing N values and let me know.

CC @scrook

rgerkin commented 8 years ago

@stripathy For now you could probably just add a column with the corresponding NEDM IDs, and then write a quick script to grab the corresponding Django object, add the N's, and save. Or are you looking for a long-term solution?

JustasB commented 8 years ago

Just talked to @stripathy. He won't be able to update the records right away, but promised to do so before the conference (I have a reminder set). In the meantime, we can use the data from this cell to built out the NeuronUnit pooling logic (then swap out the cell when the data is ready), and the data I collected thus far in the XLS to generate our observations.

rgerkin commented 8 years ago

@JustasB https://github.com/scidash/neuronunit/issues/13

JustasB commented 8 years ago

@stripathy @rgerkin

Since apparently I have nothing more fun to do on a Saturday morning, I looked over all the papers for the mitral cell and looked closer into the issues I found earlier.

Most of the issues with the error term were where the paper reported mean+/-SD instead of SEM. Most of the issues go away after I compare the value in the "Extracted Value" column to the computed Mean+/-SEM (using the N's I found and the error type in the paper).

I think the confusion for me was that "Extracted Value" sounds like "the raw mean +/- error value one would see in the paper". In reality, it means something like "extracted mean +/- extracted or computed SEM". Maybe consider changing the column name to "Extracted Mean ± SEM (N)"?

There are two remaining issues:

rgerkin commented 8 years ago

@JustasB So basically you are saying almost all of the "extracted value" values are SEM already (due perhaps to Shawn converting them before providing the data?)

JustasB commented 8 years ago

@rgerkin I didn't want to make that claim everywhere, but I later checked the api data returned for the Mitral cell, and for the MC the err term (if available) was always the SEM.

I don't know if this is by design or just coincidence for MCs.

JustasB commented 8 years ago

@stripathy This is is that reminder email we talked about last time we spoke on the phone. You mentioned that you will update the DB with the N's for the mitral cell articles.

The following XLS contains the Ns for each property in the articles: MitralCellNs.xlsx

stripathy commented 8 years ago

Hi @JustasB - I just updated the DB to add the info on N's per mitral cell articles.

I also updated the API to return an 'error_type' field indicating whether errors stored in the database were computationally imputed to be SD or SEM. As you're going through a few articles and checking for SD or SEM info, can you let me know if the imputed error types are off? If so, I can ask my curators to start curating for SD type if that seems necessary.

Is there anything else you need on my end for this feature request?

JustasB commented 8 years ago

Thank you @stripathy . I'm seeing the N's -- they match what I have. I'm also seeing the error_type field.

I went through all the papers again to see what error type the papers reported. I see a few differences between what I found in the papers and what I get from the api.

Just to clarify, if the api returns, for example:

"err": 4.9, "err_norm": 4.9, "error_type": "sd", "n": 10

Does "sd" mean that: 1) 4.9 is the standard deviation - OR - 2) 4.9 is the SEM, but the paper reported the standard deviation, whose value would be 4.9 * sqrt(10)?

stripathy commented 8 years ago

The "sd" means that the error value of 4.9 encoded in the database is imputed to be a standard deviation. If you find systematic discrepancies with what the API reports please let me know.

JustasB commented 8 years ago

@stripathy Ok with that clarification, the values mostly match what I found in the papers. There are just two more issues that I found:

With these last two guys left, it appears that all the other issues have been resolved.

stripathy commented 8 years ago

Ok - I've updated the data and errors for those two articles. I'm going to close this issue, but feel free to reopen it if anything else comes up. Thanks!

JustasB commented 8 years ago

Everything looks great. Thank you @stripathy for all your help!