Error in cbind_all(x) - Githubissues

ydorant commented 5 years ago

Hi Thierry,

I'm running assigner to test assignment success between two large groups of samples and I found a weird error message occurring in the marker.number option.

What is the error ?: when I define a marker list > 3 elements (last element corresponding to all markers) it crash.

VCF file detailed version 4.2 from stacks v.1.48 N individuals = 4066 N snps = 7151

R Script example

library(assigner)

message("perform assignment...")
#Two Regions - north & south
assign1 <- assigner::assignment_ngs(
  data = "batch_3_p70_x9_H0.6_y9_no_singletons_4066indv.vcf",
  marker.number = c(100, 500, 1000, 2500, 6000, "all"),
  sampling.method = "ranked",
  thl = 0.5, iteration.method = 10,
  assignment.analysis = "gsi_sim",
  subsample = 1000,
  iteration.subsample=5,
  strata = "strata_NS.txt",
  parallel.core = 5,
  #common.markers = FALSE,
  #snp.ld=FALSE,
  folder = "THL_0.5_iteration_10_regions-sampling_NOR_SOU_test")

Shell error: radiator is working on file ... VCF is biallelic updating markers metadata and stats [====...] 100% Error in cbind_all(x) : Argument 3 must be length 7151, not 63

So 7151 correspond to the nSNPs number however I don't understand the 63

Bioinformatic resources: Unix server calcul (SLURM) CPU allowed from 2 to 8 (same error) RAM allowed 20G R v.3.5.1

Something weird is that when I fix the marker.number to only 3 element as c(100,1000,7151) it works...

Do you have an idea about this ? I think that the pb is located in radiator vcf importation.

Best, Yann.

thierrygosselin commented 5 years ago

Hi Yann, yes it looks like a vcf problem. Could you send by email:

"batch_3_p70_x9_H0.6_y9_no_singletons_4066indv.vcf",
"strata_NS.txt"

or subset of the vcf so that I can reproduce the error...

Best Thierry

thierrygosselin commented 5 years ago

you can generate a subset with this command:

#TERMINAL
head -n 200 batch_3_p70_x9_H0.6_y9_no_singletons_4066indv.vcf > test.assigner.vcf

ydorant commented 5 years ago

Thanks Thierry, I prepared a Repo for you with three data files

VCF (subset of the 500 first lines)
Strata_NS.txt --> all samples
Strata_NS_subset.txt --> 1000 samples per group (North vs. South)

You can find the repository at : https://github.com/ydorant/assigner_bugs Yann.

thierrygosselin commented 5 years ago

Yann everything should work now (it was a radiator problem not assigner)

devtools::install_github("thierrygosselin/radiator")
devtools::install_github("thierrygosselin/assigner")
assigner::install_gsi_sim(fromSource = TRUE)

test1

test1 <- assigner::assignment_ngs(
  data = "lobster_subset_snps.vcf",
  marker.number = c(100, 200, "all"),
  sampling.method = "ranked",
  thl = 0.9,
  iteration.method = 10,
  assignment.analysis = "gsi_sim",
  strata = "strata_NS.txt")

fig <- test1$plot.assignment  + 
  ggplot2::facet_grid(SUBSAMPLE~CURRENT) + 
  ggplot2::scale_y_continuous(limits = c(0,100))
fig

test2 with subsample

test2 <- assigner::assignment_ngs(
  data = "lobster_subset_snps.vcf",
  marker.number = c(100, 200, "all"),#, 1000, 2500, 6000, "all"),
  sampling.method = "ranked",
  thl = 0.9,
  iteration.method = 10,
  assignment.analysis = "gsi_sim",
  subsample = 1000,
  iteration.subsample = 5,
  strata = "strata_NS.txt")

fig <- test2$plot.assignment  + 
  ggplot2::facet_grid(SUBSAMPLE~CURRENT) + 
  ggplot2::scale_y_continuous(limits = c(0,100))
fig

Friendly reminder before running assignment with assigner

assignment analysis should be conducted on a filtered dataset
minimize artifacts
remove patterns of missingness or heterozygosity
remove duplicates/replicates
removing closekin helps with low fst species/pops
discard of individuals with outlier ind. heterozygosity
minimize error rate and heterozygote miscall rate

Individuals with very high coverage will have very high heterozygosity, most of the time. Heterozygous individuals will generate artifacts, they are close to everybody because they have both copy of the SNP alleles, reducing assignment and Fst.

run this:

vcf.qc <- radiator::tidy_vcf(
data = "batch_3_p70_x9_H0.6_y9_no_singletons_4066indv.vcf", 
strata = "strata_NS.txt",
verbose = TRUE,
path.folder = "lobster.vcf.qc")

For more info : check out ??radiator::tidy_vcf

using the tidy dataset you could run these 3 functions:

dup <- radiator::detect_duplicate_genomes(data = vcf.qc$tidy.data)
mix <- radiator::detect_mixed_genomes(data = vcf.qc$tidy.data)
err <- radiator::detect_het_outliers(data = vcf.qc$tidy.data)

Duplicates are usually samples within the range 0-0.25 Closekin are usually samples within the range 0.25-0.75

missing data can create:

false duplicates, but the outcome are the same, one of the 2 in the pair as to go.
lower ind. heterozygosity: if a pattern exist between the groups, this can generate artifactual structure.

thierrygosselin / assigner

Error in cbind_all(x) #13