Open pbpearman opened 5 years ago
Hi again Thierry, I ran the same procedure on the stickleback data from the tutorial. I imputed the missing genotypes again in genomic converter, obtaining the saved imputed and unimputed datasets. I then used them both with imputation_accuracy().
stickle <- genomic_converter("example_vcf2dadi_ferchaud_2015.vcf",
filename = "stickle1",
strata="strata.stickleback.tsv",
imputation.method = "rf")
I got:
#######################################################################
###################### grur: imputations_accuracy #########################
#######################################################################
Data provided still contains missing genotypes,
accuracy will be mesured on common non-missing genotypes
Removing 31790 markers not in common between datasets
Information dropped from the analysis
Information dropped from the analysis
$markers.dropped
[1] "groupI2836223825290" "groupI2836624653314" "groupI2836624653352"
[4] "groupI2836624653353" "groupI2838225273727" "groupI2845028141141"
[7] "groupI2845028141145" "groupI345256537935" "groupI345256537939"
[10] "groupI345256537953" "groupI345256537980" "groupI567412268369"
[13] "groupI567412268370" "groupI603417836877" "groupII163217623945" ...
The original number of markers was 31802. I checked the values in the resulting object, 'stickle', and all classification errors for individuals, populations etc, were all listed as zero. I also tried using the imputed and unimputed datasets that are in the object produced by genomic_converter(). Again, about 99% of the markers were dropped. I examined the marker names of the two datasets. They are identical, as far as I can see.
Is this right? Why are almost all the markers showing as not in common among the two datasets? This is not the behavior I expected. Best, Peter
Hi Peter, definitely a bug. This function was not tested with lots of datasets and it's in the treatment of markers names before and after imputations the problem.
Hi Thierry, I have filtered a .vcf dataset so to create a version with low missingness of sites and samples. I then run the following lines to impute genotypes and output .rad and .tped files, with and without rf imputation:
The run finishes with the following results section:
############################### RESULTS ############################### Data format of input: vcf.file Biallelic data Number of common markers: 156 Number of chromosome/contig/scaffold: 113 Number of individuals 551 Number of populations 56
I then check the imputation accuracy, but all 156 markers are dropped, evidently for not being shared:
Data provided still contains missing genotypes, accuracy will be mesured on common non-missing genotypes Removing 156 markers not in common between datasets $
Information dropped from the analysis
$Information dropped from the analysis
$markers.dropped [1] "Erumb1_s0109580385621" "Erumb1_s00047668134936" [3] "Erumb1_s0064804568686" "Erumb1_s0130817588428"[5] "Erumb1_s0228196711151" "Erumb1_s0228970911265" .....
Misclassification Error: by populations
A tibble: 0 x 2 ... with 2 variables: POP_IDMisclassification Error: by individuals
A tibble: 0 x 2 ... with 2 variables: INDIVIDUALSMisclassification Error: by markers
A tibble: 0 x 2 ... with 2 variables: MARKERSMisclassification Error: overall
[1] NaNI think this is a bug. Do you? If you want, I can provide you with the datasets. Just let me know. Best, Peter