Ask our help about number of features in training and held-out test datasets

itmoon7 / onconpc

Clinical sequencing-based primary site classifier

GNU General Public License v2.0

32 stars 9 forks source link

Ask our help about number of features in training and held-out test datasets #3

Closed ttyywyy closed 11 months ago

ttyywyy commented 12 months ago

Dear Intae Moon, Your study is very interesting, and help me a lot in our field. One question: training dataset and test dataset are from different institutions, the panels are different, and the number of targeted genes is different, how could you train and test your XGBoost model in these two datasets which the number of features is different.

Thank you in advance for your help.

Best regards, Yangyang Wang ttyywyy@126.com

itmoon7 commented 11 months ago

Hi Yangyang

Thanks for the question. As shown in the paper, we created features based on genes commonly targeted across three centers; we'd like to note that these centers, DFCI, MSK, and VICC, have similar gene coverage. In Extended Data Fig. 2 of our paper, we showed this approach achieved robust performance across the three centers. We're currently investigating ways to incorporate more centers (potentially with much smaller gene coverage).

All the best, Intae