szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
111 stars 33 forks source link

Request: option for allow missing data. #17

Open magicDGS opened 8 years ago

magicDGS commented 8 years ago

In the release 1.0.1, you removed the support for missing data. I would like to know if it could be possible to provide an option when you have a lot of missing data, because in some datasets it is not possible to have any SNP without at least one haplotype with missing data.

Thank you very much.

szpiech commented 8 years ago

selscan has never supported missing data, although an early version may not have appropriately warned and exited when it was found. I would be interested in adding support for it, however it isn't a trivial problem. The rehh program supports missing data, although it isn't clear to me how that implementation affects the statistics when there is a lot of missing data.

A program like SHAPEIT2 offers an option to impute missing genotypes without a reference during phasing. If you have regions that are overwhelmed with missing data, I'd likely recommend deleting the SNPs in that region and providing a --max-gap value that will ensure skipping it.

magicDGS commented 8 years ago

Thank you very much for your answer. I'd thought that selscan supported missing data because of the changelog info (10APR2014), but I was wrong. It is true that is not a trivial problem because haplotypes start to disapear when missing data is inserted.

Regarding SHAPEIT2 to impute missing genotypes, the main problem that I have is that there is no program to impute data for haploids (neither for compute any EHH statistic in haploids). Of course, I was planning to remove SNPs with lots of missing data, but there is almost always one haplotype without the SNP.

Anyway, thank you very much for your feedback and I will try to use rehh first with missing data and if not I will impute the data somehow.