knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

extract.haps() does not handle all missing data #121

Open fdchevalier opened 5 years ago

fdchevalier commented 5 years ago

Hello,

I have an issue related to missing data with the function extract.haps(). I have diploid data that was processed by different tools before being loaded in R. One of the tools seems to recode the filtered out data with a single . and not ./.. This has some consequences when this coded missing data is the first value of the dataset: the ploidy is not recognized and the function returns an unexpected result.

Here is a reproducible example:

data(vcfR_test)

extract.haps(vcfR_test, unphased_as_NA=F)
#Variant 3 processed
#           NA00001_0 NA00001_1 NA00002_0 NA00002_1 NA00003_0 NA00003_1
#rs6054257  "G"       "G"       "A"       "G"       "A"       "A"
#20_17330   "T"       "T"       "T"       "A"       "T"       "T"
#rs6040355  "G"       "T"       "T"       "G"       "T"       "T"
#20_1230237 "T"       "T"       "T"       "T"       "T"       "T"
#microsat1  "GTC"     "G"       "GTC"     "GTCT"    "G"       "G"

# Replacing first cell of sample data with missing data coded by a single dot
vcfR_test@gt[1,2] <- ".:.:.:.,."

extract.haps(vcfR_test, unphased_as_NA=F)
#           NA00001 NA00002 NA00003
#rs6054257  NA      "1|0"   "1/1"
#20_17330   NA      "0|1"   "0/0"
#rs6040355  "1|2"   "2|1"   "2/2"
#20_1230237 "0|0"   "0|0"   "0/0"
#microsat1  "0/1"   "0/2"   "1/1"

Here is the code that leads to this behavior: https://github.com/knausb/vcfR/blob/c76017b6bb66db44b4d84a3a61a1dc27fa53a8ec/R/extract_gt.R#L174-L177

A solution would be to not search for the first non NA value but to actually search for the genotype separator (/ or | or : if haploid data). Here is the piece of code that fixes it:

first.gt <- unlist(strsplit(x@gt[,-1][grep("^[0-9]*[\\|/:]",x@gt[,-1])][1], ":"))[1] 

This is different from #29 which, if I am right, is the C function that will actually extract the haplotypes.

Anyway, thank you for fixing this.

Fred

knausb commented 5 years ago

Hi @fdchevalier,

Thanks for bringing this to my attention! I currently do not have any projects using haploids, so I wan't likely to discover this. And I feel your post is timely. In this issue https://github.com/knausb/vcfR/issues/117#issue-381217071 that I'm currently working on I should make sure addresses your issue as well. This comment is here to remind me they're related.

Thanks!