sameerkhurana10 / DSOL_rv0.2

deep protein solubility prediction
MIT License
34 stars 14 forks source link

Number of test examples #1

Closed svgsponer closed 5 years ago

svgsponer commented 5 years ago

Hi,

Thanks a lot for the great work. I would like to reuse the data you provide to do a comparison of various techniques on the same task. While preparing the dataset I found little inconsistency and it would be great if you could shorlty clarify it for me.

In your paper, you speak of 2001 test examples which corresponds to the number of examples in _test_srcbio but _testsrc and _testtgt both only contain 1999 examples. It seems there are two negative examples missing as there are only 999 avialable. Not a big deal but depending on which examples are missing the aglignment of the biological data and sequences provided will be skewed.

Thanks a lot for you clarification.

sameerkhurana10 commented 5 years ago

thanks.

we can find the two missing sequences for you.

I don't think i understand your comment about "aglignment of the biological data and sequences provided will be skewed". Can you clarify?

On Wed, Oct 31, 2018 at 5:45 PM svgsponer notifications@github.com wrote:

Hi,

Thanks a lot for the great work. I would like to reuse the data you provide to do a comparison of various techniques on the same task. While preparing the dataset I found little inconsistency and it would be great if you could shorlty clarify it for me.

In your paper, you speak of 2001 test examples which corresponds to the number of examples in test_src_bio but test_src and test_tgt both only contain 1999 examples. It seems there are two negative examples missing as there are only 999 avialable. Not a big deal but depending on which examples are missing the aglignment of the biological data and sequences provided will be skewed.

Thanks a lot for you clarification.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sameerkhurana10/DSOL_rv0.2/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AHV3fQX2Gk54HIALPkxCYVYipD0jSQXTks5uqhnfgaJpZM4YFYs4 .

-- conversation enriches understanding, but solitude is the school of genius.

svgsponer commented 5 years ago

Great, thanks a lot for the fast reaction!

With the alignment of the biological data, I mean that when a missing sequence is somewhere in the middle of the _testsrc file all following sequences will be off by one from the corresponding line in _test_srcbio. Consequently, when I combine the two files the wrong SCRATCH features are assigned to a sequence.

Just to make sure the last column in _test_srcbio corresponds to the target variable?

sameerkhurana10 commented 5 years ago

right. I will try to find it. Its been a year. Maybe its just the first 1999 sequences from src_bio.

Why don't you try to run it. The code won't throw an error, because it is just taking the first 1999 sequences.

@raghvendra5688 pinging Raghavendra if he remembers.

raghvendra5688 commented 5 years ago

Hi, I have added the two missing samples and double checked that there is no problem of alignment of the biological data.

@svgsponer What all methods do you plan to run as we are writing a continuation paper comparing latest deep learning methods on the same dataset?

svgsponer commented 5 years ago

Hi,

Great thanks a lot!

@raghvendra5688 I currently work on various methods that learn linear models in the unlimited length k-mer feature space based on work done for https://github.com/svgsponer/SqLoss. A continuation paper sounds interesting and I'm curious to see new improvements. What architectures are you planning to try out?

raghvendra5688 commented 5 years ago

We are planning to use GANs and VAE for the same problem. I will update you about results when we have a draft ready.