Issues using tidy_vcf function

laurabenestan commented 4 years ago

Describe the bug I have tried to calculate Fst on a vcf file obtained through a "home-made" ddocent pipeline. To do so, I first need to import and transform the vcf in a tidy data frame in R.

I used the following command: data <- radiator::tidy_vcf( data = "Amp_a_TotalSNPs_filtred_without_contigs.haplo.vcf", strata = "pop_map_Amp_a.txt", gt = TRUE, tidy.check = FALSE# skip the test for number of markers, otherwise it ask during tidying... )

It seems that it recognised my pop strata and the number of individuals in the VCF. Yet I am getting the following error message at the end (see screenshot)

Thanks in advance for your help,

Laura

thierrygosselin commented 4 years ago

Salut Laura,

looks like a radiator problem, not assigner. But from the look of it it's a problem with haplotypes and VCF...

Send the strata file and the vcf (or parts of it) by email, I'll have a look at it.

thierrygosselin commented 4 years ago

I usually have no problem detecting dDocent VCF, but since you seems to have modified it, it my be just that that interfere with my detection. Let's check it out with your vcf...

thierrygosselin commented 4 years ago

A couple of problems that I see with the data you sent

Here is the details

check1: the VCF format

radiator::detect_genomic_format(data = "Amp_a_TotalSNPs_filtred_without_contigs.haplo.vcf")
[1] "vcf.file" # good

check2: The id in the VCF

ids <- radiator::extract_individuals_vcf(data = "Amp_a_TotalSNPs_filtred_without_contigs.haplo.vcf")
ids$INDIVIDUALS <- radiator::clean_ind_names(ids$INDIVIDUALS) # to match strata that does the cleaning
# so far so good

check3: The id in the strata

strata <- radiator::read_strata(strata = "pop_map_Amp_a.txt", verbose = TRUE) %$% strata

check4: Compare IDs in the strata and VCF...

identical(strata$INDIVIDUALS, ids$INDIVIDUALS)
# FALSE
setdiff(strata$INDIVIDUALS, ids$INDIVIDUALS) 
# [1] "Amp-a-17-MF-192" # This id is not in the VCF
setdiff(ids$INDIVIDUALS, strata$INDIVIDUALS)
# [1] "Amp-a-16-MY-793" # This sample is in the VCF but not in the strata file...

First problem I see, this might be intentional or not, usually having a strata with less sample is ok and during the import process it's used as a blacklist. But having samples in the strata not present in the VCF is not good practice.

I'm not sure I want to invest time to make an additional check...

Generate a new strata file with sample actually IN THE VCF...

strata <- ids %>% 
dplyr::mutate(STRATA = stringi::stri_sub(str = INDIVIDUALS, from = 10, to = 11)) %>% 
readr::write_tsv(x = ., path = "strata_laura_20200124.tsv")

Let's test both strata on the dataset for reading the vcf

With Laura's strata:

check5 <- radiator::read_vcf(
  data = "Amp_a_TotalSNPs_filtred_without_contigs.haplo.vcf", 
  strata = "pop_map_Amp_a.txt", 
  verbose = TRUE
)
# Error: Argument 2 must be length 168015, not 169671

With new strata:

check6 <- radiator::read_vcf(
  data = "Amp_a_TotalSNPs_filtred_without_contigs.haplo.vcf", 
  strata = "strata_laura_20200124.tsv", 
  verbose = TRUE
)
# Still an error...
# Error: Argument 2 must be length 175320, not 177048

I ran the code by hand and the problem is when generating the summary per individuals. The AD field in the VCF as a wrong length.

Now before I invest more time in checking to fix this I must ask:

The vcf is from dDocent that uses freeBayes (version : freeBayes v1.3.1-dirty). It's the first time, in a long time, I've had problem with freeBayes or dDocent/freeBayes combo.

Question 1: So what did you do exactly to the original vcf that came out of dDocent ?

Question 2: Could you send me the original vcf that came out of dDocent?

This would allow me to check if the problem is:

from dDocent...
from you manipulating the VCF...
from me (using SeqArray) using this version of freeBayes VCF...

Cheers T

thierrygosselin commented 4 years ago

In the VCF header it says that in the FORMAT field you should find 10 genotypes metadata:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Number of observation for each allele">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum depth in gVCF output block.">

When in reality there's just 8: GT:DP:AD:RO:QR:AO:QA:GL

thierrygosselin commented 4 years ago

The genotype field (GT) as the appropriate length in the SeqArray GDS: 2x24x7305

DP (the read depth) is ok: 24x7305 (sample x markers) AD (alleles depth) are not ok: 24x14754 (sample x markers x 2) it should be 24x14610

thierrygosselin commented 4 years ago

So the freeBayes VCF parsing by SeqArray is not ok... unless I'm missing something here

thierrygosselin commented 4 years ago

Do you mind if I share the VCF file with SeqArray dev ?

thierrygosselin commented 4 years ago

Since this is just for assignment analysis, you could also generate a VCF with the GT field and use it inside assigner... it might be faster for you.

laurabenestan commented 4 years ago

Hi Thierry,

Thanks a lot ! I do not mind if you share the vcf with them. I can provide you another vcf (we have a total of 21 species with vcf like that!). Maybe it will help to figure out where the problem are coming from. Laura

On January 24, 2020 at 8:33 AM, Thierry Gosselin notifications@github.com wrote:

Do you mind if I share the VCF file with SeqArray dev ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

thierrygosselin commented 4 years ago

Ok so strategically let’s look if it not you during VCF modification: send me a vcf straight from dDocent (unfiltered besides dDocent)
If this still doesn’t work or if I cannot identify the cause and fix it i’le sens the vcf. No worry I make it anonymous first by replacing markers and sample IDs

thierrygosselin commented 4 years ago

Definitely a SeqArray problem. I've sent the VCF to Xiuwen.

thierrygosselin commented 4 years ago

Ok so not a SeqArray problem.

SeArray import what it reads. The problem is that it's a weird haplotype VCF.

e.g. with a VCF where all the contains info are removed (this is causing too much trouble with SeqArray: remove manually). One of the line causing a parsing failure, line 75

dDocent_Contig_15   178 .   C   T   2932.18 .   AB=0.46;ABP=3.70517;AC=4;AF=0.0909091;AN=44;AO=23;CIGAR=2X2M2X;DP=329;DPB=329;DPRA=0.877193;EPP=52.9542;EPPR=454.677;GTI=1;LEN=1;MEANALT=1.75;MQM=60;MQMR=60;NS=22;NUMALT=3;ODDS=0;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=836;QR=7585;RO=208;RPL=23;RPP=52.9542;RPPR=454.677;RPR=0;RUN=1;SAF=0;SAP=52.9542;SAR=23;SRF=0;SRP=454.677;SRR=208;TYPE=snp;technology.Illumina=1   GT:DP:AD:RO:QR:AO:QA:GL 0/0:12:12,0,0,0:12:432:0,0,0:0,0,0:0,-3.61236,-39.2188,-3.61236,-39.2188,-39.2188,-3.61236,-39.2188,-39.2188,-39.2188   0/0:12:12,0,0,0:12:444:0,0,0:0,0,0:0,-3.61236,-40.3063,-3.61236,-40.3063,-40.3063,-3.61236,-40.3063,-40.3063,-40.3063   0/0:26:26,0,0,0:26:955:0,0,0:0,0,0:0,-7.82678,-86.2682,-7.82678,-86.2682,-86.2682,-7.82678,-86.2682,-86.2682,-86.2682   0/0:16:0,10,0,6:0:0:10,0,6:366,0,208:-47.1754,-17.2518,-14.2415,-47.1754,-17.2518,-47.1754,-30.2773,0,-30.2773,-28.4712 0/0:14:12,0,0,2:12:444:0,0,2:0,0,74:-2.81146,-6.42382,-42.748,-6.42382,-42.748,-42.748,0,-36.694,-36.694,-36.0919   0/1:12:0,6,6,0:0:0:6,6,0:218,218,0:-35.9697,-18.1664,-16.3602,-18.1664,0,-16.3602,-35.9697,-18.1664,-18.1664,-35.9697   0/0:8:8,0,0,0:8:280:0,0,0:0,0,0:0,-2.40824,-25.5375,-2.40824,-25.5375,-25.5375,-2.40824,-25.5375,-25.5375,-25.5375  0/1:9:4,0,5,0:4:148:0,5,0:0,185,0:-14.3007,-15.5049,-27.6129,0,-12.4778,-10.9727,-15.5049,-27.6129,-12.4778,-27.6129    0/0:15:15,0,0,0:15:551:0,0,0:0,0,0:0,-4.51545,-49.929,-4.51545,-49.929,-49.929,-4.51545,-49.929,-49.929,-49.929 0/0:14:0,14,0,0:0:0:14,0,0:518,0,0:-46.9624,-4.21442,0,-46.9624,-4.21442,-46.9624,-46.9624,-4.21442,-46.9624,-46.9624   1/0:21:0,0,10,11:0:0:0,10,11:0,367,400:-63.0351,-63.0351,-63.0351,-33.0322,-33.0322,-30.0219,-30.3681,-30.3681,0,-27.0567   0/0:8:8,0,0,0:8:284:0,0,0:0,0,0:0,-2.40824,-25.9017,-2.40824,-25.9017,-25.9017,-2.40824,-25.9017,-25.9017,-25.9017  0/0:13:5,8,0,0:5:183:8,0,0:296,0,0:-23.0808,0,-12.9134,-24.5859,-15.3216,-39.5402,-24.5859,-15.3216,-39.5402,-39.5402   0/0:26:25,0,0,1:25:914:0,0,1:0,0,37:0,-7.52575,-82.2096,-7.52575,-82.2096,-82.2096,-4.12895,-79.1824,-79.1824,-78.8813  0/0:26:26,0,0,0:26:939:0,0,0:0,0,0:0,-7.82678,-84.8248,-7.82678,-84.8248,-84.8248,-7.82678,-84.8248,-84.8248,-84.8248   0/0:6:0,6,0,0:0:0:6,0,0:213,0,0:-19.5151,-1.80618,0,-19.5151,-1.80618,-19.5151,-19.5151,-1.80618,-19.5151,-19.5151  0/0:35:21,14,0,0:21:766:14,0,0:515,0,0:-36.1554,0,-58.7306,-42.477,-62.9451,-105.056,-42.477,-62.9451,-105.056,-105.056 0/0:16:10,0,0,6:10:370:0,0,6:0,0,222:-15.5216,-18.5319,-48.802,-18.5319,-48.802,-48.802,0,-30.6399,-30.6399,-28.8338    0/0:14:6,8,0,0:6:216:8,0,0:295,0,0:-22.6889,0,-15.5751,-24.4951,-17.9834,-42.1149,-24.4951,-17.9834,-42.1149,-42.1149   0/0:16:16,0,0,0:16:585:0,0,0:0,0,0:0,-4.81648,-52.9862,-4.81648,-52.9862,-52.9862,-4.81648,-52.9862,-52.9862,-52.9862   0/1:8:0,6,2,0:0:0:6,2,0:222,66,0:-23.8579,-5.66589,-3.85971,-18.5319,0,-17.9298,-23.8579,-5.66589,-18.5319,-23.8579

shows only 2 alleles: C and T but the AD field as 4 values. Usually I would expect 4 alleles.

Although it says haplotype VCF it looks like a biallelic hybrid.

Anything to explain this ?

thierrygosselin commented 4 years ago

Because the AD info doesn't look reliable here, I'm going to write something that detects freeBayes and dirty in the VCF header and will not import this info in the GDS.

Unless you have something else to propose?

thierrygosselin commented 4 years ago

I've never seen those problem with dDocent before so until further information on this version, when freeBayes and dirty is detected in the header it keeps only the GT and DP fields. This is with the latest radiator release (you'll have to reinstall radiator)...

I've tested your 2 datasets and it works. Below example of what I tested with the first dataset sent:

test1 <- radiator::read_vcf(
  data = "Amp_a_TotalSNPs_filtred_without_contigs.haplo.vcf",
  strata = "strata_laura_20200124.tsv"
)

test2 <- radiator::tidy_genomic_data(
  data = "Amp_a_TotalSNPs_filtred_without_contigs.haplo.vcf",
  strata = "strata_laura_20200124.tsv",
  gt = TRUE,
  tidy.check = FALSE
)

test3 <- assigner::assignment_ngs(
  data = test2,
  assignment.analysis = "gsi_sim",
  marker.number = c(100, 200, "all"),
  markers.sampling = "ranked",
  thl = 0.3
  )

Note that for unbiased assignment, you should get rid of these samples:

Amp-a-16-MY-103
Amp-a-16-MY-759
Aca-n-17-SC-216

They are either another specie or something went wrong during the wet lab with those 3...

Feel free to reopen the issue (in radiator this time :) if you're still experiencing a problem

thierrygosselin / assigner

Issues using tidy_vcf function #18