pbpearman commented 5 years ago

I try to convert a .vcf.gz file to tidy genomic data in Rstudio: eu_snps_1 <- genomic_converter( data = "UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz", vcf.metadata = TRUE, common.markers = FALSE, strata = "strata_eu.tsv" ) Resulting in this error message and traceback:

` ####################################################################### ##################### radiator::genomic_converter ##################### ####################################################################### Function arguments and values: Working directory: /Volumes/pearman-1/lud11_docs/upv_research/projects/eriogonoideae/eriogonoideae/data/GBS/560_samples_20181029/Erumb1 Input file: UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz Strata: strata_eu.tsv Population levels: no Population labels: no Output format(s): tidy Filename prefix: no Filters: Blacklist of individuals: no Blacklist of genotypes: no Whitelist of markers: no monomorphic.out: TRUE snp.ld: no common.markers: FALSE max.marker: no pop.select: no maf.thresholds: no

Imputations options: imputation.method: no

parallel.core: 15

#######################################################################

Importing data

Show Traceback Error in stringi::stri_replace_allfixed(str = as.character(x), pattern = c("", : object 'input' not found`

stringi::stri_replace_allfixed(str = as.character(x), pattern = c("", ":", " "), replacement = c("-", "-", ""), vectorize_all = FALSE)
radiator::clean_ind_names(input$INDIVIDUALS)
radiator::tidy_genomic_data(data = data, vcf.metadata = vcf.metadata, blacklist.id = blacklist.id, blacklist.genotype = blacklist.genotype, whitelist.markers = whitelist.markers, monomorphic.out = monomorphic.out, max.marker = max.marker, snp.ld = snp.ld, common.markers = common.markers, ...
genomic_converter(data = "UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz", vcf.metadata = TRUE, common.markers = FALSE, strata = "strata_eu.tsv")

The files exist: > file.exists("strata_eu.tsv") [1] TRUE

> file.exists("UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz") [1] TRUE

The top of the unzipped .vcf.gz looks like this:

fileformat=VCFv4.0

fileDate=Thu Oct 25 12:00:44 2018

source=GBS-SNP-CROP

phasing=partial

INFO=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AC3894 AC3895 AC3896...

Erumb1_s00000011 30280 . C T 40 PASS . GT:DP:AD ./.:0:.,. ./.:0:.,. ./.:0:.,. .....

The top of the tab separated, 561-line strata file looks like this: INDIVIDUALS STRATA FLOWCELL variety2 MACHINE year AC3894 PBP1 C9RB9ACXX(2101) munzii 26B 2014 AC3895 PBP1 C9RB9ACXX(2101) munzii 26B 2014 AC3896 PBP1 C9RB9ACXX(2101) munzii 26B 2014 AC3897 PBP1 C9RB9ACXX(2101) munzii 26B 2014 AC3898 PBP1 C9RB9ACXX(2*101) munzii 26B 2014.....

The FLOWCELL names all have an "*" even though only the last one appears here.

It looks to me like everything is there that should be. Maybe I am just doing something wrong. I am on this R:

version _
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 5.1
year 2018
month 07
day 02
svn rev 74947
language R
version.string R version 3.5.1 (2018-07-02) nickname Feather Spray

And RStudio version 1.0.153

pbpearman commented 5 years ago

Hi Thierry, Here is some more info and some more results. The .vcf.gz files I have were produced with GBS-SNP-CROP. The original files is large, 560 samples and 261,000 sites with coverage >5. I have two versions, of these data, one from alignment to a draft genome and one de novo. The file above was the one from the genome. I tried the de novo version to see if the outcome would be different from what I described above. To give the story away, it wasn't. What I did was to run the .vcf.gz through vcftools with --minDP 6, just to make sure I had what I wanted.

Using out.lmiss from vcftools, I got a table of SNPs for which at least 50% of the samples are covered, and I used that in the vcftools option --positions to prune the data set by sample coverage. I bgzipped the resulting .vcf and this .vcf.gz file works in missing_visualization(). Using it in genomic_converter(), I get the same error message as above: Error in stringi::stri_replace_allfixed(str = as.character(x), pattern = c("", : object 'input' not found

So, I added a print statement to genomic_converter() to retrieve the variable data.type during the run of the function. This produced: [1] "the data type is \037\x8b\b\004" So, something isn't being read correctly. I then tried to run the modified function on the decompressed file (.vcf). This produced a lot of temporary files, followed by RStudio crash. The output that was recorded before the crash was as follows, and I note that the file type was recognized correctly:

####################################################################### ##################### radiator::genomic_converter ##################### ####################################################################### Function arguments and values: Working directory: /Volumes/pearman-1/lud11_docs/upv_research/projects/eriogonoideae/eriogonoideae/data/GBS/560_samples_20181029/mock/test Input file: UMBELLA_Mock_samples_gt_50pct_covered.recode.vcf Strata: strata_eu.tsv Population levels: no Population labels: no Output format(s): tidy Filename prefix: no Filters: Blacklist of individuals: no Blacklist of genotypes: no Whitelist of markers: no monomorphic.out: TRUE snp.ld: no common.markers: FALSE max.marker: no pop.select: no maf.thresholds: no

Imputations options: imputation.method: no

parallel.core: 6

####################################################################### [1] "the data type is vcf.file"

Importing data

Reading VCF... Generated a filters parameters file: filters_parameters_20181129@1800.tsv Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Loading required package: gdsfmt

Attaching package: ‘SeqArray’