lucidrains / enformer-pytorch

Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch
MIT License
435 stars 82 forks source link

metric for enformer #9

Open Rachel66666 opened 2 years ago

Rachel66666 commented 2 years ago

Hello, can I ask how you find of the human pearson R is 0.625 for validation, and 0.65 for test? Couldn't find any information in the paper. Is there any other place that records this?

rnsherpa commented 1 year ago

Sorry for reviving an old thread, but I'd also like to know where these correlation numbers come from with respect to the original paper. It looks like @jstjohn did the correlation analysis here. Would you be able to shed some light on the question?

biginfor commented 2 months ago

Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable.

jstjohn commented 2 months ago

Why is the second better? Here each input and target is a different batch along the sequence axis. Taking the global correlation of points is a correlation metric while the mean of subsequence correlations (with arbitrary cut points even) is something else that needs more justification in my opinion. Sent from my iPhoneOn Sep 29, 2024, at 7:40 AM, Eli @.***> wrote: Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

biginfor commented 2 months ago

Sorry, I think you're right. Calculating the correlation based on batches and then taking the average is not a good idea as it ignores the global distribution of the data. 2e3d512f6e1e61a2916bf4584a93309

Why is the second better? Here each input and target is a different batch along the sequence axis. Taking the global correlation of points is a correlation metric while the mean of subsequence correlations (with arbitrary cut points even) is something else that needs more justification in my opinion. Sent from my iPhoneOn Sep 29, 2024, at 7:40 AM, Eli @.> wrote: Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

jstjohn commented 2 months ago

Interesting. I mean the thing that seemed off was that your proposal was over arbitrary cut points. Taking the mean after splitting on a nuisance variable on the other hand could make a ton of sense. That could help control for confounding for example. You could cut the data up by something you don’t want to be included in your correlation measurement. Chromosome boundary for example, maybe GC percent bucket or some other feature you think is a nuisance variable that is not biologically meaningful. Then you could calculate correlation within each group and average. Smaller groups though might be noisier which would make real signals harder to detect. Just some other thoughts!Sent from my iPhoneOn Sep 29, 2024, at 7:07 PM, Eli @.***> wrote: Sorry, I think you're right. Calculating the correlation based on batches and then taking the average is not a good idea as it ignores the global distribution of the data. 2e3d512f6e1e61a2916bf4584a93309.jpg (view on web)

Why is the second better? Here each input and target is a different batch along the sequence axis. Taking the global correlation of points is a correlation metric while the mean of subsequence correlations (with arbitrary cut points even) is something else that needs more justification in my opinion. Sent from my iPhoneOn Sep 29, 2024, at 7:40 AM, Eli @.> wrote: Let's assume the data are as follows: batch1 : input1, target1 batch2 : input2, target2 batch3 : input3, target3 The idea behind calculating Pearson correlation coefficient using the original TensorFlow version of the enformer is as follows: cor(c(input1,input2,input3),c(target1,target2,target3)) The idea behind calculating Pearson correlation coefficient using the original Pytorch version of the enformer is as follows: mean(cor(input1,target1),cor(input2,target2),cor(input3,target3)) I think the second option is reasonable. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

biginfor commented 2 months ago

Thanks! That helps a lot!