lemieuxl / pyGenClean

Automated genetic data clean up procedure in Python.
GNU General Public License v3.0
3 stars 1 forks source link

Genome file should be computed using autosomal markers only #16

Closed lemieuxl closed 8 years ago

lemieuxl commented 8 years ago

When creating the genome file, only the autosomal markers should be present in the dataset (for computing IBS and MDS file.

This should only affect the find_related_samples script since the same input data will be used for creating the MDS file in thecheck_ethnicity` script.

lemieuxl commented 8 years ago

This can be done between the prunning and the extraction of the pruned markers.

We can only do a set comparison between the pruned markers (in the *.prune.in file) and the autosomal markers (from the bim file).

lemieuxl commented 8 years ago

It looks like the --indep-pairwise already select autosomal markers (it process only autosomal markers). As described in Plink's manual:

[...] it is probably best to apply this analysis to a subset that are pruned to be in approximate linkage equilibrium, say on the order of 50,000 autosomal SNPs. Use the --indep-pairwise and --indep commands to achieve this, described here.

lemieuxl commented 8 years ago

After more scrutiny, it appears that the sexual chromosomes are indeed pruned (unlike what is written in Plink's documentation). We need to address this.