trendscenter / coinstac

Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation
MIT License
42 stars 19 forks source link

Computation output #108

Closed swashcap closed 7 years ago

swashcap commented 7 years ago

Support basic computation output that’s a little nicer than JSON in a pre:

computation-output

This could be a simple table or key-value list.

(Related to #12.)

swashcap commented 7 years ago

Feedback on this from our 2016-02-03 demo:

From single-shot (and potentially multi-shot)?

swashcap commented 7 years ago

cea22e43800609eb85f754028266022ae83a157b takes care of the labeling:

updated-table

everner-ds commented 7 years ago

Looks great. Remember to add a Beta vector line to each site so that the user can see the values of the weights at each site.

swashcap commented 7 years ago

Okay, here’s the single-shot results table:

single-result

…and here’s the associated single-shot computation’s data:

{
   "betaVectorLocal": [
       [
           261.7073126832788,
           0.14687494407149815,
           19.685582697995343
       ]
   ],
   "averageBetaVector": [
       261.7073126832788,
       0.14687494407149815,
       19.685582697995343
   ],
   "rSquaredLocalOriginal": [
       0.01605226498320167
   ],
   "tValueLocalOriginal": [
       [
           4.424540787810683,
           0.10741026556247937,
           1.061973364925653
       ]
   ],
   "pValueLocalOriginal": [
       [
           0.00003524515412323126,
           0.9147752278265422,
           0.2919506542159831
       ]
   ],
   "rSquaredLocal": [
       0.01605226498320167
   ],
   "tValueLocal": [
       [
           4.424540787810683,
           0.10741026556247937,
           1.061973364925653
       ]
   ],
   "pValueLocal": [
       [
           0.00003524515412323126,
           0.9147752278265422,
           0.2919506542159831
       ]
   ],
   "rSquaredGlobal": 0.7629053334438827,
   "tValueGlobal": [
       4.424540787810683,
       0.10741026556247937,
       1.061973364925653
   ],
   "pValueGlobal": [
       0.00003524515412323126,
       0.9147752278265422,
       0.2919506542159831
   ],
   "xLabel": [
       "age",
       "isControl"
   ],
   "complete": true
}

Here’s the multi-shot results table:

multi-results

…and the multi-shot computation data:

{
   "betaVectorLocal": [
       [
           376.7895665495135,
           15698.014223014823,
           610652.954774585
       ],
       [
           453.1091445205883,
           9424.602361166473,
           592595.7843176883
       ]
   ],
   "currW": [
       155.48927623020555,
       154.94252698528325,
       155.39848849955436
   ],
   "rSquaredLocalOriginal": [
       0.07059482142962326,
       0.0396424116937778
   ],
   "tValueLocalOriginal": [
       [
           0.6699176055249315,
           2.0588964975034507,
           25.099857426122632
       ],
       [
           0.939335883163565,
           1.4079137258345895,
           28.245665781758714
       ]
   ],
   "pValueLocalOriginal": [
       [
           0.5051461401566619,
           0.04328119776872974,
           0
       ],
       [
           0.3507884026875574,
           0.1635822472584425,
           0
       ]
   ],
   "rSquaredLocal": [
       -92.16542336957824,
       -113.33224649601522
   ],
   "tValueLocal": [
       [
           0.027612016729228778,
           0.0020297174913335163,
           0.0006379677791797525
       ],
       [
           0.02954273137604082,
           0.002121366760148516,
           0.0006788474114907067
       ]
   ],
   "pValueLocal": [
       [
           0.9780513235240982,
           0.9983864020082835,
           0.9994928408205808
       ],
       [
           0.9765158297350187,
           0.9983134573317571,
           0.999460317205159
       ]
   ],
   "rSquaredGlobal": -99.86808991223616,
   "tValueGlobal": [
       0.040945852625448274,
       0.0029742635006051606,
       0.0009433193185826931
   ],
   "pValueGlobal": [
       0.9673965559436446,
       0.9976310291164383,
       0.9992486338235937
   ],
   "complete": true
}

@everner-ds @JingMing623 @pliz Do the tables look correct? Are we missing anything (per-site beta vectors)? Can we remove unneeded data from the computations’ results?

everner-ds commented 7 years ago

-For R^2 and the p-values, would it be possible to only use scientific notation if their values are below a certain threshold (say 0.01)? The use of scientific notation for values >0.01 makes the values harder to read for me.

-For single-shot, why is global R^2 != local R^2 if there is only one site? It looks like all the other parameters are equal.

-For both single-shot and multi-shot, add another line to the table for each site that is called "B Vector", as in the global table. Populate it with the data from betaVectorLocal.

-The multi-shot table also needs a "B vector" line on the global table.

-For multi-shot, why are the R^2 values so bad? Is this a problem computing R^2 or are the fits really this bad? I've never seen an R^2 of around -100.

-For multi-shot, what are all the *LocalOriginal variables? What do they mean? If there is no need to include them in the table, then I don't see the need to include them in the data.

swashcap commented 7 years ago

Let’s look at each property:

@JingMing623 Can we add averageBetaVector to multi-shot? Can we drop some of these unused properties?

JingMing623 commented 7 years ago

Of course we should drop some properties:

Let’s look at each property:

  • betaVectorLocal Array[]: Collection of beta vectors for each site ( this is betaVector from each site's local data, may drop off)
  • currW number[] (only multi-shot): let's change currW into averageBetaVector for multi-shot too
  • averageBetaVector number[] (only single-shot): Site-wide beta vector
  • rSquaredLocalOriginal Array[]: r^2 for site local betaVector, can drop
  • tValueLocalOriginal Array[]: t for site local betaVector, can drop
  • pValueLocalOriginal Array[]: p for local betaVector, can drop
  • rSquaredLocal number[]: Collection of R-squared values for each site
  • tValueLocal Array[]: Collection of T values for each site?
  • pValueLocal Array[]: Collection of P values for each site?
  • rSquaredGlobal number: Site-wide R-squared
  • tValueGlobal number[]: Site-wide T value
  • pValueGlobal number[]: Site-wide P value
  • xLabel Object (only single-shot): Collection of camelCase covariates. We can safely drop this as the UI.
  • complete boolean: Required by COINSTAC internals to mark the computation as finished

@JingMing623 https://github.com/JingMing623 Can we add averageBetaVector to multi-shot? Can we drop some of these unused properties?

Yes, I think we can make averageBetaVector as a consistent name for both regression.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-278798833, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WODzb7owshnqGlLarpp_bqHvACftCks5ra5ZOgaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

JingMing623 commented 7 years ago

And I have a question due to single-shot. It seems there is only 1 site, but why we get two set of betaVectorLocal? it is for two freesurfer ROIs?

And most of the stats output here, for example the negative r^2 are expected. I think you are using the thalamus analysis data, correct? these data have very bad fits.

Jing

On Thu, Feb 9, 2017 at 3:50 PM, Jing Ming jming@mrn.org wrote:

Of course we should drop some properties:

Let’s look at each property:

  • betaVectorLocal Array[]: Collection of beta vectors for each site ( this is betaVector from each site's local data, may drop off)
  • currW number[] (only multi-shot): let's change currW into averageBetaVector for multi-shot too
  • averageBetaVector number[] (only single-shot): Site-wide beta vector
  • rSquaredLocalOriginal Array[]: r^2 for site local betaVector, can drop
  • tValueLocalOriginal Array[]: t for site local betaVector, can drop
  • pValueLocalOriginal Array[]: p for local betaVector, can drop
  • rSquaredLocal number[]: Collection of R-squared values for each site
  • tValueLocal Array[]: Collection of T values for each site?
  • pValueLocal Array[]: Collection of P values for each site?
  • rSquaredGlobal number: Site-wide R-squared
  • tValueGlobal number[]: Site-wide T value
  • pValueGlobal number[]: Site-wide P value
  • xLabel Object (only single-shot): Collection of camelCase covariates. We can safely drop this as the UI.
  • complete boolean: Required by COINSTAC internals to mark the computation as finished

@JingMing623 https://github.com/JingMing623 Can we add averageBetaVector to multi-shot? Can we drop some of these unused properties?

Yes, I think we can make averageBetaVector as a consistent name for both regression.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-278798833, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WODzb7owshnqGlLarpp_bqHvACftCks5ra5ZOgaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

everner-ds commented 7 years ago

Should we call it betaVectorGlobal instead of averageBetaVector? Is there a possibility that the global beta vector is not an average of the local beta vectors? Even if it will always be the average, betaVectorGlobal seems like a better name because it matches the naming scheme used for the other variables (Local vs. Global), making it easier to understand what it means and where it should go on the table.

everner-ds commented 7 years ago

Jing, you might be seeing the old page with the bad screenshot. Try to refresh the page.

JingMing623 commented 7 years ago

Good idea, betaVectorGlobal sounds good. So far, for these two regression, the global beta vector are all generated somehow from averaging the local betaVectors.

On Thu, Feb 9, 2017 at 3:55 PM, everner-ds notifications@github.com wrote:

Should we call it betaVectorGlobal instead of averageBetaVector? Is there a possibility that the global beta vector is not an average of the local beta vectors?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-278801871, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WOFYLf1qIK_UQOiS6AQbrIHUVCIrRks5ra5lJgaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

swashcap commented 7 years ago

It seems there is only 1 site, but why we get two set of betaVectorLocal?

There’s only one set for single-shot:

{
   "betaVectorLocal": [
       [
           261.7073126832788,
           0.14687494407149815,
           19.685582697995343
       ]
   ],
   /* ... */
}

These results come from the thalamus-analysis dataset, which seems to produce undesirable results.

everner-ds commented 7 years ago

We need to keep this one because it stores the local beta vectors for each local site, right?

JingMing623 commented 7 years ago

Okay. Thanks. Yeah, I am looking at the old page....

On Thu, Feb 9, 2017 at 3:58 PM, Cory Reed notifications@github.com wrote:

It seems there is only 1 site, but why we get two set of betaVectorLocal?

There’s only one set for single-shot:

{ "betaVectorLocal": [ [ 261.7073126832788, 0.14687494407149815, 19.685582697995343 ] ], / ... / }

These results come from the thalamus-analysis dataset, which seems to produce undesirable results.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-278802589, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WOCiOb19y4A504O31wIbiOHsHCpRJks5ra5oSgaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

JingMing623 commented 7 years ago

I thought we should keep betaVectorLocal for comparison purpose. However, based on my talk with Sergey and Vince on the demo day, it seems only betaVectorGlobal is desired result to show, and only T values and P values associated with betaVectorGlobal for each site's local data should be displayed.

On Thu, Feb 9, 2017 at 4:00 PM, everner-ds notifications@github.com wrote:

  • betaVectorLocal Array[]: Collection of beta vectors for each site ( this is betaVector from each site's local data, may drop off)

We need to keep this one because it stores the local beta vectors for each local site, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-278803039, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WOKEnx5mTbDBqJD1iWZPgXcVZFoJUks5ra5qQgaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

everner-ds commented 7 years ago

Really? That wasn't my understanding. Let's ask the Slack group.

JingMing623 commented 7 years ago

Yeah, that's a little confusing, and the reason for not showing the local fit stats and betaVector is that these local fits stats most probably will be better than the global fits. And the meaning of using mult-site data may not be finding a better fit, but make the fit more general, and avoid over fitting, if my understands right. But sure, Eric, you can ask on slack group for confirmation.

On Thu, Feb 9, 2017 at 4:06 PM, everner-ds notifications@github.com wrote:

Really? That wasn't my understanding. Let's ask the Slack group.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-278804422, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WOHa10pFWEzw_pugnq5eOz4S4Zpbpks5ra5wPgaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

JingMing623 commented 7 years ago

Okay, I see the updated single-shot page, and yes, I find the local r^2 and global r^2 are not consistent, I need to check the code.

sergeyplis commented 7 years ago

Vince wants a different approach, but here is my opinion and its justification

I believe in factoring and incapsulation on conceptual level. They improve perception and interpretation. Let's go unix-way: do one thing at a time and do it well. Guided by this principle and based on the fact that both multi-shot and single-shot regressions are there to compute the betas for the overall virtual dataset, I request only doing this one thing.

Let's not confuse the user with simultaneous computation and display of 2 completely unrelated results: regression on virtually pooled dataset and local regressions. Instead let's have 2 algorithms in the toolbox: one for each. The local regression report will not display anything global, but will:

  1. display betas and their corresponding Ts and ps at each site.

While the multi- and single-shot regressions will display:

  1. Global values of betas, their goodness of fit to the overall dataset (R^2), Ts and ps.
  2. For each site they will display Ts and ps for the global model fit to the local results
JingMing623 commented 7 years ago

Sergey, I do agree displaying two sets of Ws, Ps and Ts for each site a little bit confusing. But if I am the researcher, I probably will also want the local regression results too, to see how far the global data distort local regression : D

everner-ds commented 7 years ago

Now that I think about it, I don't understand the meaning of betaVectorLocal for multi-shot. According to this algorithm, there are no local beta vectors computed in multi-shot. Where do the betaVectorLocal values come from?

JingMing623 commented 7 years ago

betaVectorLocal is just betaVector. But when I pass all variables to the remote node on the last run, I change the name betaVector to betaVectorLocal for clarify

On Thu, Feb 9, 2017 at 4:59 PM, everner-ds notifications@github.com wrote:

Now that I think about it, I don't understand the meaning of betaVectorLocal for multi-shot. According to this algorithm http://www.frontiersin.org/files/Articles/204805/fnins-10-00365-HTML/image_m/fnins-10-00365-t003.jpg, there are no local beta vectors computed in multi-shot. Where do the betaVectorLocal values come from?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-278816054, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WOAh_2ldltDCVWCjJvPhbCbsOo-Exks5ra6hMgaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

everner-ds commented 7 years ago

But currW is the betaVectorGlobal, right? Why are there multiple values of betaVectorLocal if they are just copies of betaVectorGlobal?

JingMing623 commented 7 years ago

Eric, betaVectorLocal is not for multi-shot regression calculation, it is from a completely separate local ridge regression

everner-ds commented 7 years ago

Okay, thanks for clarifying. So it should not exist in the multi-shot data, which is what you said earlier.

JingMing623 commented 7 years ago

just like run oneshot for the local data and get the betaVector. The reason to calculate betaVectorLocal is for comparison purpose

everner-ds commented 7 years ago

Also, this is kind of late in the game, but why not include the Standard Error (SE) of the beta vector in the table? It is usually included in regression results in statistics packages.

everner-ds commented 7 years ago

OK, it's now my understanding that the results for each site are the goodness of fit of the global model to the local data. If that is the case, then I do not think there should be any beta vector shown on the local site. Sorry for the confusion. However, if this is the case, then there should be some short explanation on the results table of what these values mean.

I don't really understand what the meaning of the p-values and t-values is for the individual sites though. Maybe someone can explain that to me later.

JingMing623 commented 7 years ago

Eric, it is very natural to get confused by the results table :) And I think it is good to discuss it out. The p-value and t-value for each site are also calculated by global beta and local data, indicating the local covariate significance based on the global fit ( averageBetaVector).

swashcap commented 7 years ago

Guided by this principle and based on the fact that both multi-shot and single-shot regressions are there to compute the betas for the overall virtual dataset, I request only doing this one thing.

So, only show “global” betas, Ps and Ts for the run? Don’t bother with per-site values?

While the multi- and single-shot regressions will display:

  1. Global values of betas, their goodness of fit to the overall dataset (R^2), Ts and ps.
  2. For each site they will display Ts and ps for the global model fit to the local results

This is exactly what we’re doing right now, correct? I’m struggling to understand your proposition.

Why are there multiple values of betaVectorLocal if they are just copies of betaVectorGlobal?

I understood betaVectorLocal contained an array of arrays. betaVectorLocal’s nested arrays corresponded with sites’ beta vectors by index, as just like tValueLocal’s nested arrays. Is this incorrect?

We should take a step back and consider what our users want to see. What computation run data is important to them? It may also be nice to add descriptions to the table explaining what each value actually means.

JingMing623 commented 7 years ago

I think the agreement so far is to displaying:

  1. Global values of betas, their goodness of fit to the overall dataset (R^2), Ts and ps.
  2. For each site they will display Ts and ps for the global model fit to the local results

    Just as you did in the previous demo.

And the possible adding will be:

  1. Add the Global values of betas for each site as well, on top of Ts and ps, for a quick refreshing of betas which associated with these Ts and ps

Frankly, I am not feel very comfortable with this output format. But maybe we should use it for next week's Demo. Let's discuss it again during the meeting.

And for betaVectorLocal, you are correct, it is a two dimensional matrix, with dimension 1 represent to site index and dimension 2 represent to intercept+covariate order.

Also, I have fix the bug for r^2, now the global r^2 and local r^2 show same result for one site. It turns out to be a typo when calculate globalmean. Should I open a new branch and put code to?

Jing

On Fri, Feb 10, 2017 at 11:16 AM, Cory Reed notifications@github.com wrote:

Guided by this principle and based on the fact that both multi-shot and single-shot regressions are there to compute the betas for the overall virtual dataset, I request only doing this one thing.

So, only show “global” betas, Ps and Ts for the run? Don’t bother with per-site values?

While the multi- and single-shot regressions will display:

  1. Global values of betas, their goodness of fit to the overall dataset (R^2), Ts and ps.
  2. For each site they will display Ts and ps for the global model fit to the local results

This is exactly what we’re doing right now, correct? I’m struggling to understand your proposition.

Why are there multiple values of betaVectorLocal if they are just copies of betaVectorGlobal?

I understood betaVectorLocal contained an array of arrays. betaVectorLocal’s nested arrays corresponded with sites’ beta vectors by index, as just like tValueLocal’s nested arrays. Is this incorrect?

We should take a step back and consider what our users want to see. What computation run data is important to them? It may also be nice to add descriptions to the table explaining what each value actually means.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-279022269, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WOM_gBqrOli2Y2xTgHwLlIOem4opdks5rbKlngaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

swashcap commented 7 years ago

@JingMing623 Yes, please open a branch. I’ll merge and bump the computation’s version within COINSTAC.

JingMing623 commented 7 years ago

@swashcap Okay, I have pushed the change to rSquaredfix, please have a check. Only add four letter: .data. Thanks.

JingMing623 commented 7 years ago

Hi All,

 To wrap up our discussion yesterday, I modified our previous

statistic_output_format table.

 [image: Inline image 1]

 Please have a look and let me know if you have any question about it.

I also attach the file here FYI. Thanks.

Jing

On Fri, Feb 10, 2017 at 12:17 PM, Cory Reed notifications@github.com wrote:

@JingMing623 https://github.com/JingMing623 Yes, please open a branch. I’ll merge and bump the computation’s version within COINSTAC.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-279039016, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WOBcEIBTHJwyGlLuw-XLpw4Tj9pW7ks5rbLetgaJpZM4K6Q0s .

-- Jing Ming Postdoctoral Research Fellow The Mind Research Network Lovelace Respiratory Research Institute

swashcap commented 7 years ago

Uploading @JingMing623’s table as the email attachment didn’t make it to GitHub’s UI.

image

Based on @everner-ds’s and @pliz’s comments I’m not convinced this table meets the requirements. Thoughts?

swashcap commented 7 years ago

For posterity, here’s @JingMing623’s explanation of *LocalOriginal versus *Local variables from Slack:

tValueLocal is t value using global fit beta to local site data, tValueLocalOriginal is t value using local fit beta to local site data.

JingMing623 commented 7 years ago

Actually, Sergey has different opinion on this issue, he would like to show tValueLocal on local site, just like you did now. But Vince and Eric both think showing tValueLocalOriginal maybe more simpler and not confuse users.

On Tuesday, February 14, 2017, Cory Reed notifications@github.com wrote:

For posterity, here’s @JingMing623 https://github.com/JingMing623’s explanation of LocalOriginal versus Local variables from Slack:

tValueLocal is t value using global fit beta to local site data, tValueLocalOriginal is t value using local fit beta to local site data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MRN-Code/coinstac/issues/108#issuecomment-279871077, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1WONwNyfrMu5kwzDguoGGgt5-cL3naks5rcjnigaJpZM4K6Q0s .

swashcap commented 7 years ago

0a4cfa1b48f7c499922f88d96dd96fd1385420a7 added beta value rows for each site. 3f55df52bb61d42949b3ae59a9372db8df79135b addressed care of the scientific notation formatting.

swashcap commented 7 years ago

147 addressed the incorrect local R-squared problem we discovered during demos on 2017-02-15.

swashcap commented 7 years ago

Our demo runs on 2017-03-22 revealed the “Not a Number” problem was likely due to a bad dataset. Closing the issue.