thierrygosselin / assigner

Population assignment analysis using R
http://thierrygosselin.github.io/assigner
GNU General Public License v3.0
17 stars 6 forks source link

Error in cbind_all(x) #13

Closed ydorant closed 5 years ago

ydorant commented 5 years ago

Hi Thierry,

I'm running assigner to test assignment success between two large groups of samples and I found a weird error message occurring in the marker.number option.

What is the error ?: when I define a marker list > 3 elements (last element corresponding to all markers) it crash.

VCF file detailed version 4.2 from stacks v.1.48 N individuals = 4066 N snps = 7151

R Script example

library(assigner)

message("perform assignment...")
#Two Regions - north & south
assign1 <- assigner::assignment_ngs(
  data = "batch_3_p70_x9_H0.6_y9_no_singletons_4066indv.vcf",
  marker.number = c(100, 500, 1000, 2500, 6000, "all"),
  sampling.method = "ranked",
  thl = 0.5, iteration.method = 10,
  assignment.analysis = "gsi_sim",
  subsample = 1000,
  iteration.subsample=5,
  strata = "strata_NS.txt",
  parallel.core = 5,
  #common.markers = FALSE,
  #snp.ld=FALSE,
  folder = "THL_0.5_iteration_10_regions-sampling_NOR_SOU_test")

Shell error: radiator is working on file ... VCF is biallelic updating markers metadata and stats [====...] 100% Error in cbind_all(x) : Argument 3 must be length 7151, not 63

So 7151 correspond to the nSNPs number however I don't understand the 63

Bioinformatic resources: Unix server calcul (SLURM) CPU allowed from 2 to 8 (same error) RAM allowed 20G R v.3.5.1

Something weird is that when I fix the marker.number to only 3 element as c(100,1000,7151) it works...

Do you have an idea about this ? I think that the pb is located in radiator vcf importation.

Best, Yann.

thierrygosselin commented 5 years ago

Hi Yann, yes it looks like a vcf problem. Could you send by email:

"batch_3_p70_x9_H0.6_y9_no_singletons_4066indv.vcf",
"strata_NS.txt"

or subset of the vcf so that I can reproduce the error...

Best Thierry

thierrygosselin commented 5 years ago

you can generate a subset with this command:

#TERMINAL
head -n 200 batch_3_p70_x9_H0.6_y9_no_singletons_4066indv.vcf > test.assigner.vcf
ydorant commented 5 years ago

Thanks Thierry, I prepared a Repo for you with three data files

You can find the repository at : https://github.com/ydorant/assigner_bugs Yann.

thierrygosselin commented 5 years ago

Yann everything should work now (it was a radiator problem not assigner)

devtools::install_github("thierrygosselin/radiator")
devtools::install_github("thierrygosselin/assigner")
assigner::install_gsi_sim(fromSource = TRUE)

test1

test1 <- assigner::assignment_ngs(
  data = "lobster_subset_snps.vcf",
  marker.number = c(100, 200, "all"),
  sampling.method = "ranked",
  thl = 0.9,
  iteration.method = 10,
  assignment.analysis = "gsi_sim",
  strata = "strata_NS.txt")
fig <- test1$plot.assignment  + 
  ggplot2::facet_grid(SUBSAMPLE~CURRENT) + 
  ggplot2::scale_y_continuous(limits = c(0,100))
fig

test2 with subsample

test2 <- assigner::assignment_ngs(
  data = "lobster_subset_snps.vcf",
  marker.number = c(100, 200, "all"),#, 1000, 2500, 6000, "all"),
  sampling.method = "ranked",
  thl = 0.9,
  iteration.method = 10,
  assignment.analysis = "gsi_sim",
  subsample = 1000,
  iteration.subsample = 5,
  strata = "strata_NS.txt")
fig <- test2$plot.assignment  + 
  ggplot2::facet_grid(SUBSAMPLE~CURRENT) + 
  ggplot2::scale_y_continuous(limits = c(0,100))
fig

Friendly reminder before running assignment with assigner

Individuals with very high coverage will have very high heterozygosity, most of the time. Heterozygous individuals will generate artifacts, they are close to everybody because they have both copy of the SNP alleles, reducing assignment and Fst.

  1. run this:
vcf.qc <- radiator::tidy_vcf(
data = "batch_3_p70_x9_H0.6_y9_no_singletons_4066indv.vcf", 
strata = "strata_NS.txt",
verbose = TRUE,
path.folder = "lobster.vcf.qc")

For more info : check out ??radiator::tidy_vcf

  1. using the tidy dataset you could run these 3 functions:
    dup <- radiator::detect_duplicate_genomes(data = vcf.qc$tidy.data)
    mix <- radiator::detect_mixed_genomes(data = vcf.qc$tidy.data)
    err <- radiator::detect_het_outliers(data = vcf.qc$tidy.data)

Duplicates are usually samples within the range 0-0.25 Closekin are usually samples within the range 0.25-0.75

missing data can create: