parklab / MosaicForecast

A mosaic detecting software based on phasing and random forest
MIT License
62 stars 21 forks source link

Question on Training RF Models with Refined Genotypes #42

Closed JunhuiLi1017 closed 7 months ago

JunhuiLi1017 commented 7 months ago

Hi Yanmei,

I have a quick question about training Random Forest (RF) models using my own dataset with the refined model method. I saw you said "it's ok to manually-check ~100 hap=3 sites with igv", however, will variants with the refined genotypes such as 'mosaic,' 'het,' 'refhom,' and 'repeat' all be included? Additionally, do you have any recommendations for the number of variants needed for each refined genotype?

--Junhui

douym commented 7 months ago

Hi Yanmei,

I have a quick question about training Random Forest (RF) models using my own dataset with the refined model method. I saw you said "it's ok to manually-check ~100 hap=3 sites with igv", however, will variants with the refined genotypes such as 'mosaic,' 'het,' 'refhom,' and 'repeat' all be included? Additionally, do you have any recommendations for the number of variants needed for each refined genotype?

--Junhui

HI Junhui,

Sorry for the confusion. Based on my experience, only the "hap=3" category need to be further checked with igv, since these variants could be most probably further classified as "mosaic" and "repeat". As for "hap=2", these are most probably "het", and "hap>3" are most probably "repeat", igv-check for these sites are not necessary. You could use ~200-300 variants in total to train the refine model (these include hap2->het, hap3->repeat/mosaic, hap>3->repeat). Hope this solves your problem.

Best,

Yanmei

JunhuiLi1017 commented 7 months ago

Hi Yanmei, I have a quick question about training Random Forest (RF) models using my own dataset with the refined model method. I saw you said "it's ok to manually-check ~100 hap=3 sites with igv", however, will variants with the refined genotypes such as 'mosaic,' 'het,' 'refhom,' and 'repeat' all be included? Additionally, do you have any recommendations for the number of variants needed for each refined genotype? --Junhui

HI Junhui,

Sorry for the confusion. Based on my experience, only the "hap=3" category need to be further checked with igv, since these variants could be most probably further classified as "mosaic" and "repeat". As for "hap=2", these are most probably "het", and "hap>3" are most probably "repeat", igv-check for these sites are not necessary. You could use ~200-300 variants in total to train the refine model (these include hap2->het, hap3->repeat/mosaic, hap>3->repeat). Hope this solves your problem.

Best,

Yanmei

Hi Yanmei,

Thanks for your clarification, this is very helpful.

Best, Junhui