hansenlab / mpra

5 stars 5 forks source link

Spike-in controls #1

Closed ndtippens closed 2 months ago

ndtippens commented 6 years ago

This isn't really a package issue, so much as a technical question.

First of all, thanks for the great package! I've got it working pretty smoothly on some datasets we've generated recently. However, I've noticed that the model seems to assume the average element activity defines the background distribution, and we have some experiments where this is not the case. To address this, we included spike-in sequences as a control, and I would like to know if it's possible to calibrate the linear model using these spike-ins before considering the larger dataset?

Thanks in advance, Nate

lmyint commented 6 years ago

Hi Nate, thanks for your message!

There are element-specific linear models, so there are element-specific background activities (these are the mean activity measures for the control/reference group). We certainly expect different baseline/background activity for the different elements, and this is indeed accounted for in the model since the inferences obtained at the end are for log-fold changes between groups.

Are you using spike in sequences to calibrate a threshold for defining which elements are simply active? This would involve filtering out rows of the count matrices that correspond to activity measures that fall below a threshold defined by your spike ins.

On Thu, Apr 5, 2018 at 4:53 PM, Nate Tippens notifications@github.com wrote:

This isn't really a package issue, so much as a technical question.

First of all, thanks for the great package! I've got it working pretty smoothly on some datasets we've generated recently. However, I've noticed that the model seems to assume the average element activity defines the background distribution, and we have some experiments where this is not the case. To address this, we included spike-in sequences as a control, and I would like to know if it's possible to calibrate the linear model using these spike-ins before considering the larger dataset?

Thanks in advance, Nate

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/mpra/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AFILlOiCgMb35oEqwss7nRQ_1Od9WoHnks5tloRDgaJpZM4TJI9j .

-- Leslie Myint PhD candidate - Biostatistics Johns Hopkins Bloomberg School of Public Health

ndtippens commented 6 years ago

Thanks for the quick reply!

Yes, I should have been a little clearer. We are not comparing treatment conditions, but rather trying to screen inactive vs active sequences. We have a population of "negative sequences" that we would like to use as estimate of expression and variation between inactive elements. Then we'd like to decide whether each element in the test set is significantly more or less active than these negative sequences.

Nate

On Thu, Apr 5, 2018 at 5:18 PM, Leslie Myint notifications@github.com wrote:

Hi Nate, thanks for your message!

There are element-specific linear models, so there are element-specific background activities (these are the mean activity measures for the control/reference group). We certainly expect different baseline/background activity for the different elements, and this is indeed accounted for in the model since the inferences obtained at the end are for log-fold changes between groups.

Are you using spike in sequences to calibrate a threshold for defining which elements are simply active? This would involve filtering out rows of the count matrices that correspond to activity measures that fall below a threshold defined by your spike ins.

On Thu, Apr 5, 2018 at 4:53 PM, Nate Tippens notifications@github.com wrote:

This isn't really a package issue, so much as a technical question.

First of all, thanks for the great package! I've got it working pretty smoothly on some datasets we've generated recently. However, I've noticed that the model seems to assume the average element activity defines the background distribution, and we have some experiments where this is not the case. To address this, we included spike-in sequences as a control, and I would like to know if it's possible to calibrate the linear model using these spike-ins before considering the larger dataset?

Thanks in advance, Nate

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/mpra/issues/1, or mute the thread https://github.com/notifications/unsubscribe- auth/AFILlOiCgMb35oEqwss7nRQ_1Od9WoHnks5tloRDgaJpZM4TJI9j .

-- Leslie Myint PhD candidate - Biostatistics Johns Hopkins Bloomberg School of Public Health

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/mpra/issues/1#issuecomment-379079889, or mute the thread https://github.com/notifications/unsubscribe-auth/AOL_ObathV4T7n6YplkwdaSO0JJLGUVtks5tlooSgaJpZM4TJI9j .

lmyint commented 6 years ago

Ok I understand. The methods in this package were developed with the intention to compare the activity of each element in one or more conditions, a little different than what you're describing.

One way to do what you're describing within our framework would be to create an artificial comparison condition by appending the counts for the spike ins as extra columns in the DNA and RNA count matrices. This would be the same for every row (element). Then you would proceed with the pipeline as normal, as if you were doing a standard differential analysis. This might be computationally intensive if you have, say, C spike-ins in each of S samples because this would result in C x S more columns in the count matrices, and you would have to use "corr_groups" for the model_type in mpralm(). Potentially also, there might be strange behavior in a variance stabilization part of the algorithm. I haven't tried this though, so I can't say for sure.

Another way to do what you're describing would be to do manual filtering by running your own comparison tests. You could use compute_logratio() to compute the activity measures for all elements and then iterate over your elements to test for differences against the spike ins.

On Thu, Apr 5, 2018 at 5:29 PM, Nate Tippens notifications@github.com wrote:

Thanks for the quick reply!

Yes, I should have been a little clearer. We are not comparing treatment conditions, but rather trying to screen inactive vs active sequences. We have a population of "negative sequences" that we would like to use as estimate of expression and variation between inactive elements. Then we'd like to decide whether each element in the test set is significantly more or less active than these negative sequences.

Nate

On Thu, Apr 5, 2018 at 5:18 PM, Leslie Myint notifications@github.com wrote:

Hi Nate, thanks for your message!

There are element-specific linear models, so there are element-specific background activities (these are the mean activity measures for the control/reference group). We certainly expect different baseline/background activity for the different elements, and this is indeed accounted for in the model since the inferences obtained at the end are for log-fold changes between groups.

Are you using spike in sequences to calibrate a threshold for defining which elements are simply active? This would involve filtering out rows of the count matrices that correspond to activity measures that fall below a threshold defined by your spike ins.

On Thu, Apr 5, 2018 at 4:53 PM, Nate Tippens notifications@github.com wrote:

This isn't really a package issue, so much as a technical question.

First of all, thanks for the great package! I've got it working pretty smoothly on some datasets we've generated recently. However, I've noticed that the model seems to assume the average element activity defines the background distribution, and we have some experiments where this is not the case. To address this, we included spike-in sequences as a control, and I would like to know if it's possible to calibrate the linear model using these spike-ins before considering the larger dataset?

Thanks in advance, Nate

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/mpra/issues/1, or mute the thread https://github.com/notifications/unsubscribe- auth/AFILlOiCgMb35oEqwss7nRQ_1Od9WoHnks5tloRDgaJpZM4TJI9j .

-- Leslie Myint PhD candidate - Biostatistics Johns Hopkins Bloomberg School of Public Health

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/mpra/issues/1#issuecomment-379079889, or mute the thread https://github.com/notifications/unsubscribe-auth/AOL_ ObathV4T7n6YplkwdaSO0JJLGUVtks5tlooSgaJpZM4TJI9j .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hansenlab/mpra/issues/1#issuecomment-379082707, or mute the thread https://github.com/notifications/unsubscribe-auth/AFILlELbyP48DCF59oBPICewMsPag3X7ks5tloysgaJpZM4TJI9j .

-- Leslie Myint PhD candidate - Biostatistics Johns Hopkins Bloomberg School of Public Health