Open Rachel66666 opened 2 years ago
Sorry for reviving an old thread, but I'd also like to know where these correlation numbers come from with respect to the original paper. It looks like @jstjohn did the correlation analysis here. Would you be able to shed some light on the question?
Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable.
Why is the second better? Here each input and target is a different batch along the sequence axis. Taking the global correlation of points is a correlation metric while the mean of subsequence correlations (with arbitrary cut points even) is something else that needs more justification in my opinion. Sent from my iPhoneOn Sep 29, 2024, at 7:40 AM, Eli @.***> wrote: Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Sorry, I think you're right. Calculating the correlation based on batches and then taking the average is not a good idea as it ignores the global distribution of the data.
Why is the second better? Here each input and target is a different batch along the sequence axis. Taking the global correlation of points is a correlation metric while the mean of subsequence correlations (with arbitrary cut points even) is something else that needs more justification in my opinion. Sent from my iPhoneOn Sep 29, 2024, at 7:40 AM, Eli @.> wrote: Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>
Interesting. I mean the thing that seemed off was that your proposal was over arbitrary cut points. Taking the mean after splitting on a nuisance variable on the other hand could make a ton of sense. That could help control for confounding for example. You could cut the data up by something you don’t want to be included in your correlation measurement. Chromosome boundary for example, maybe GC percent bucket or some other feature you think is a nuisance variable that is not biologically meaningful. Then you could calculate correlation within each group and average. Smaller groups though might be noisier which would make real signals harder to detect. Just some other thoughts!Sent from my iPhoneOn Sep 29, 2024, at 7:07 PM, Eli @.***> wrote: Sorry, I think you're right. Calculating the correlation based on batches and then taking the average is not a good idea as it ignores the global distribution of the data. 2e3d512f6e1e61a2916bf4584a93309.jpg (view on web)
Why is the second better? Here each input and target is a different batch along the sequence axis. Taking the global correlation of points is a correlation metric while the mean of subsequence correlations (with arbitrary cut points even) is something else that needs more justification in my opinion. Sent from my iPhoneOn Sep 29, 2024, at 7:40 AM, Eli @.> wrote: Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Thanks! That helps a lot!
Hello, can I ask how you find of the human pearson R is 0.625 for validation, and 0.65 for test? Couldn't find any information in the paper. Is there any other place that records this?