thierrygosselin / grur

grur: an R package tailored for RADseq data imputations
https://thierrygosselin.github.io/grur/
7 stars 0 forks source link

imputations_accuracy() lists all markers as not in common and drops all from analysis #9

Open pbpearman opened 5 years ago

pbpearman commented 5 years ago

Hi Thierry, I have filtered a .vcf dataset so to create a version with low missingness of sites and samples. I then run the following lines to impute genotypes and output .rad and .tped files, with and without rf imputation:

test2 <- genomic_converter(data = "UMBELLA_Erumb1_samples_gt_90pct_snps_gt_40pct_covered.recode.vcf",output="plink",filename = "test_2",strata="strata_eu_40.tsv", imputation.method = "rf")

The run finishes with the following results section:

############################### RESULTS ############################### Data format of input: vcf.file Biallelic data Number of common markers: 156 Number of chromosome/contig/scaffold: 113 Number of individuals 551 Number of populations 56

I then check the imputation accuracy, but all 156 markers are dropped, evidently for not being shared:

> imp_acc <- imputations_accuracy("test_2.rad","test_2_imputed.rad")

Data provided still contains missing genotypes, accuracy will be mesured on common non-missing genotypes Removing 156 markers not in common between datasets $Information dropped from the analysis $Information dropped from the analysis$markers.dropped [1] "Erumb1_s0109580385621" "Erumb1_s00047668134936" [3] "Erumb1_s0064804568686" "Erumb1_s0130817588428"
[5] "Erumb1_s0228196711151" "Erumb1_s0228970911265" .....

Misclassification Error: by populations A tibble: 0 x 2 ... with 2 variables: POP_ID , ME

Misclassification Error: by individuals A tibble: 0 x 2 ... with 2 variables: INDIVIDUALS , ME

Misclassification Error: by markers A tibble: 0 x 2 ... with 2 variables: MARKERS , ME

Misclassification Error: overall [1] NaN

I think this is a bug. Do you? If you want, I can provide you with the datasets. Just let me know. Best, Peter

pbpearman commented 5 years ago

Hi again Thierry, I ran the same procedure on the stickleback data from the tutorial. I imputed the missing genotypes again in genomic converter, obtaining the saved imputed and unimputed datasets. I then used them both with imputation_accuracy().

stickle <- genomic_converter("example_vcf2dadi_ferchaud_2015.vcf",
filename = "stickle1", 
strata="strata.stickleback.tsv", 
imputation.method = "rf")

I got:

####################################################################### ###################### grur: imputations_accuracy ######################### ####################################################################### Data provided still contains missing genotypes, accuracy will be mesured on common non-missing genotypes Removing 31790 markers not in common between datasets Information dropped from the analysis Information dropped from the analysis$markers.dropped [1] "groupI2836223825290" "groupI2836624653314" "groupI2836624653352"
[4] "groupI2836624653353" "groupI2838225273727" "groupI2845028141141"
[7] "groupI2845028141145" "groupI345256537935" "groupI345256537939"
[10] "groupI345256537953" "groupI345256537980" "groupI567412268369"
[13] "groupI567412268370" "groupI603417836877" "groupII163217623945" ...

The original number of markers was 31802. I checked the values in the resulting object, 'stickle', and all classification errors for individuals, populations etc, were all listed as zero. I also tried using the imputed and unimputed datasets that are in the object produced by genomic_converter(). Again, about 99% of the markers were dropped. I examined the marker names of the two datasets. They are identical, as far as I can see.

Is this right? Why are almost all the markers showing as not in common among the two datasets? This is not the behavior I expected. Best, Peter

thierrygosselin commented 5 years ago

Hi Peter, definitely a bug. This function was not tested with lots of datasets and it's in the treatment of markers names before and after imputations the problem.