jiaqingxie / Machine-Learning-in-Genomes

MIT Summer Research. Supervised by Manolis kellis
MIT License
14 stars 5 forks source link

definition of features #3

Open rgsatish opened 2 years ago

rgsatish commented 2 years ago

Hello,

I am trying to use your algorithm, could you expand on the the features :

ELL_LINE_NAME DRUG_ID LN_IC50 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 ge-1 ge0 ge1 ge2 ge3 ge4 ge5 ge6 ge7 ge8 ge9 ge10 ge11 ge12 ge13 ge14 ge15 ge16 ge17 ge18 ge19 ge20 ge21 ge22 ge23 ge24 ge25 ge26 ge27 ge28 ge29 ge30 ge31 ge32 ge33 ge34 ge35 ge36 ge37 ge38 ge39 ge40 ge41 ge42 ge43 ge44 ge45 ge46 ge47 ge48 ge49 ge50 ge51 ge52 ge53 ge54 ge55 ge56 ge57 ge58 ge59 ge60 ge61 ge62 ge63 ge64 ge65 ge66 ge67 ge68 ge69 ge70 ge71 ge72 ge73 ge74 ge75 ge76 ge77 ge78 ge79 ge80 ge81 ge82 ge83 ge84 ge85 ge86 ge87 ge88 ge89 ge90 ge91 ge92 ge93 ge94 ge95 ge96 ge97 ge98 ge99 ge100 ge101 ge102 ge103 ge104 ge105 ge106 ge107 ge108 ge109 ge110 ge111 ge112 ge113 ge114 ge115 ge116 ge117 ge118 ge119 ge120 ge121 ge122 ge123 ge124 ge125 ge126 ge127 ge128 ge129 ge130 ge131 ge132 ge133 ge134 ge135 ge136 ge137 ge138 ge139 ge140 ge141 ge142 ge143 ge144 ge145 ge146 ge147 ge148 ge149 ge150 ge151 ge152 ge153 ge154 ge155 ge156 ge157 ge158 ge159 ge160 ge161 ge162 ge163 ge164 ge165 ge166 ge167 ge168 ge169 ge170 ge171 ge172 ge173 ge174 ge175 ge176 ge177 ge178 ge179 ge180 ge181 ge182 ge183 ge184 ge185 ge186 ge187 ge188 ge189 ge190 ge191 ge192 ge193 ge194 ge195 ge196 ge197 ge198 ge199 ge200 ge201 ge202 ge203 ge204 ge205 ge206 ge207 ge208 ge209 ge210 ge211 ge212 ge213 ge214 ge215 ge216 ge217 ge218 ge219 ge220 ge221 ge222 ge223 ge224 ge225 ge226 ge227 ge228 ge229 ge230 ge231 ge232 ge233 ge234 ge235 ge236 ge237 ge238 ge239 ge240 ge241 ge242 ge243 ge244 ge245 ge246 ge247 ge248 ge249 ge250 ge251 ge252 ge253 ge254 ge255

I wanted to know what does d[0-55) and ge-1 and ge[0-255] stands for ! so we could prepare the data better. Is there a key file for the same?

Satish

jiaqingxie commented 2 years ago

Hi, rgsatish. That's a very good question. We extracted the embedding (gene embedding & drug embedding as you mentioned) from our pretained-VAE model. We used the combined embedding to directly train the downstream linear regressor, instead of loading the pretrained model params and fine-tuning. But I now thought that it's not a good idea since fine-tuning might be important in predicting IC50. I have decided to rewrite the training files and I'll update when I have finished (about several weeks I thought). I was new in deep learning at that time so I did not care about this problem.

rgsatish commented 2 years ago

Hi Jiaqing-Xie,

Thank you for your prompt reply on this. Yes, it would be good to do that! because it will help us evaluate your model better. Keeping timeline of our work in my mind, I wanted to check again what ball park week can I expect this data file on?

Satish

jiaqingxie commented 2 years ago

Hi Jiaqing-Xie,

Thank you for your prompt reply on this. Yes, it would be good to do that! because it will help us evaluate your model better. Keeping timeline of our work in my mind, I wanted to check again what ball park week can I expect this data file on?

Satish

Hi. Sorry to let you wait for 10 days. We've decided to finish the work in a new branch by August 14th.

rgsatish commented 2 years ago

Hello Jiaqing-Xie,

Just wanted to check with you on the status on the dataset preparation.

Satish

rgsatish commented 2 years ago

Hello Jiaqing Xie,

Just wanted to check with you on the status of the data. It would be great to also get some information or script of generating embedding data that is used as a input for training dataset.

We are planning to apply this model on different dataset, so it would be good to have some insights into this.

Regards, Satish

Satishkumar Ranganathan Ganakammal, Ph.D. | Bioinformatics Analyst IV Cancer Data Science Initiatives (CDSI) The Frederick National Laboratory @.**@.>[Contractor]

The Frederick National Laboratory for Cancer Research is operated by Leidos Biomedical Research, Inc. for the National Cancer Institute.

From: Jiaqing Xie @.> Date: Saturday, August 6, 2022 at 10:09 AM To: JIAQING-XIE/Machine-Learning-in-Genomes @.> Cc: Ranganathan Ganakammal, Satishkumar (NIH/NCI) [C] @.>, Author @.> Subject: [EXTERNAL] Re: [JIAQING-XIE/Machine-Learning-in-Genomes] definition of features (Issue #3)

Hi Jiaqing-Xie,

Thank you for your prompt reply on this. Yes, it would be good to do that! because it will help us evaluate your model better. Keeping timeline of our work in my mind, I wanted to check again what ball park week can I expect this data file on?

Satish

Hi. Sorry to let you wait for 10 days. We've decided to finish the work in a new branch by August 14th.

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJIAQING-XIE%2FMachine-Learning-in-Genomes%2Fissues%2F3%23issuecomment-1207221389&data=05%7C01%7Csatishkumar.ranganathan%40nih.gov%7Cced03e7bea484adaf5a008da77b554db%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C637953917942607526%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xAa8q7khhPucbpxqjm03sKDTlmAJCZUGVt%2BuphBnEoY%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAVVXHBAAVMLDHH4NWEW25B3VXZW2XANCNFSM542NSH6Q&data=05%7C01%7Csatishkumar.ranganathan%40nih.gov%7Cced03e7bea484adaf5a008da77b554db%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C637953917942607526%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2yjiAbiZd5IZzJ72OPuoK5w%2FY3nL23VWRRF%2F3amSflw%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

jiaqingxie commented 2 years ago

Hi. Sorry to cause the delay qaq. I've just uploaded the JTAVE part (drug embedding & generation). I am struggling with anthoer project ready to publish but I'll try to upload in these days. Will note you when finish. Sorry for the delay again.

rgsatish commented 2 years ago

Hello Jiaqing-Xie,

Do you have any updates on the script to embed the gene expression data( are you using TPM or FPKM-UQ value?). And also wanted to check the paper reported r2 = 0.845 for the pan cancer dataset. But when I ran the train_pan.py script I got an r2 of only 0.581 and RMSE_test = 1.790. Is this expected or should I use a different dataset to get to that r2.

Please let me know.

Satish