thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
59 stars 23 forks source link

Error upon name cleaning, gemomic_converter() in R 3.5.1 #34

Closed pbpearman closed 5 years ago

pbpearman commented 5 years ago

I try to convert a .vcf.gz file to tidy genomic data in Rstudio: eu_snps_1 <- genomic_converter( data = "UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz", vcf.metadata = TRUE, common.markers = FALSE, strata = "strata_eu.tsv" ) Resulting in this error message and traceback:

` ####################################################################### ##################### radiator::genomic_converter ##################### ####################################################################### Function arguments and values: Working directory: /Volumes/pearman-1/lud11_docs/upv_research/projects/eriogonoideae/eriogonoideae/data/GBS/560_samples_20181029/Erumb1 Input file: UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz Strata: strata_eu.tsv Population levels: no Population labels: no Output format(s): tidy Filename prefix: no Filters: Blacklist of individuals: no Blacklist of genotypes: no Whitelist of markers: no monomorphic.out: TRUE snp.ld: no common.markers: FALSE max.marker: no pop.select: no maf.thresholds: no

Imputations options: imputation.method: no

parallel.core: 15

#######################################################################

Importing data

Show Traceback Error in stringi::stri_replace_allfixed(str = as.character(x), pattern = c("", : object 'input' not found`

  1. stringi::stri_replace_allfixed(str = as.character(x), pattern = c("", ":", " "), replacement = c("-", "-", ""), vectorize_all = FALSE)

  2. radiator::clean_ind_names(input$INDIVIDUALS)

  3. radiator::tidy_genomic_data(data = data, vcf.metadata = vcf.metadata, blacklist.id = blacklist.id, blacklist.genotype = blacklist.genotype, whitelist.markers = whitelist.markers, monomorphic.out = monomorphic.out, max.marker = max.marker, snp.ld = snp.ld, common.markers = common.markers, ...

  4. genomic_converter(data = "UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz", vcf.metadata = TRUE, common.markers = FALSE, strata = "strata_eu.tsv")

The files exist: > file.exists("strata_eu.tsv") [1] TRUE

> file.exists("UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz") [1] TRUE

The top of the unzipped .vcf.gz looks like this:

fileformat=VCFv4.0

fileDate=Thu Oct 25 12:00:44 2018

source=GBS-SNP-CROP

phasing=partial

INFO=

INFO=

INFO=

INFO=

FORMAT=

FORMAT=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AC3894 AC3895 AC3896...

Erumb1_s00000011 30280 . C T 40 PASS . GT:DP:AD ./.:0:.,. ./.:0:.,. ./.:0:.,. .....

The top of the tab separated, 561-line strata file looks like this: INDIVIDUALS STRATA FLOWCELL variety2 MACHINE year AC3894 PBP1 C9RB9ACXX(2101) munzii 26B 2014 AC3895 PBP1 C9RB9ACXX(2101) munzii 26B 2014 AC3896 PBP1 C9RB9ACXX(2101) munzii 26B 2014 AC3897 PBP1 C9RB9ACXX(2101) munzii 26B 2014 AC3898 PBP1 C9RB9ACXX(2*101) munzii 26B 2014.....

The FLOWCELL names all have an "*" even though only the last one appears here.

It looks to me like everything is there that should be. Maybe I am just doing something wrong. I am on this R:

version _
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 5.1
year 2018
month 07
day 02
svn rev 74947
language R
version.string R version 3.5.1 (2018-07-02) nickname Feather Spray

And RStudio version 1.0.153

pbpearman commented 5 years ago

Hi Thierry, Here is some more info and some more results. The .vcf.gz files I have were produced with GBS-SNP-CROP. The original files is large, 560 samples and 261,000 sites with coverage >5. I have two versions, of these data, one from alignment to a draft genome and one de novo. The file above was the one from the genome. I tried the de novo version to see if the outcome would be different from what I described above. To give the story away, it wasn't. What I did was to run the .vcf.gz through vcftools with --minDP 6, just to make sure I had what I wanted.

Using out.lmiss from vcftools, I got a table of SNPs for which at least 50% of the samples are covered, and I used that in the vcftools option --positions to prune the data set by sample coverage. I bgzipped the resulting .vcf and this .vcf.gz file works in missing_visualization(). Using it in genomic_converter(), I get the same error message as above: Error in stringi::stri_replace_allfixed(str = as.character(x), pattern = c("", : object 'input' not found

So, I added a print statement to genomic_converter() to retrieve the variable data.type during the run of the function. This produced: [1] "the data type is \037\x8b\b\004" So, something isn't being read correctly. I then tried to run the modified function on the decompressed file (.vcf). This produced a lot of temporary files, followed by RStudio crash. The output that was recorded before the crash was as follows, and I note that the file type was recognized correctly:

####################################################################### ##################### radiator::genomic_converter ##################### ####################################################################### Function arguments and values: Working directory: /Volumes/pearman-1/lud11_docs/upv_research/projects/eriogonoideae/eriogonoideae/data/GBS/560_samples_20181029/mock/test Input file: UMBELLA_Mock_samples_gt_50pct_covered.recode.vcf Strata: strata_eu.tsv Population levels: no Population labels: no Output format(s): tidy Filename prefix: no Filters: Blacklist of individuals: no Blacklist of genotypes: no Whitelist of markers: no monomorphic.out: TRUE snp.ld: no common.markers: FALSE max.marker: no pop.select: no maf.thresholds: no

Imputations options: imputation.method: no

parallel.core: 6

####################################################################### [1] "the data type is vcf.file"

Importing data

Reading VCF... Generated a filters parameters file: filters_parameters_20181129@1800.tsv Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML' Also defined by ‘S4Vectors’ Loading required package: gdsfmt

Attaching package: ‘SeqArray’

The following object is masked from ‘package:stringr’:

fixed

Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt Loading required package: gdsfmt

Attaching package: ‘SeqArray’

Attaching package: ‘SeqArray’

Attaching package: ‘SeqArray’

The following object is masked from ‘package:stringr’:

fixed

The following object is masked from ‘package:stringr’:

fixed

Attaching package: ‘SeqArray’

The following object is masked from ‘package:stringr’:

fixed

The following object is masked from ‘package:stringr’:

fixed

Attaching package: ‘SeqArray’

The following object is masked from ‘package:stringr’:

fixed

Attaching package: ‘SeqArray’

Attaching package: ‘SeqArray’

Attaching package: ‘SeqArray’

Attaching package: ‘SeqArray’

Attaching package: ‘SeqArray’

The following object is masked from ‘package:stringr’:

fixed

The following object is masked from ‘package:stringr’:

fixed

The following object is masked from ‘package:stringr’:

fixed

The following object is masked from ‘package:stringr’:

fixed

The following object is masked from ‘package:stringr’:

fixed

Attaching package: ‘SeqArray’

Attaching package: ‘SeqArray’

Attaching package: ‘SeqArray’

The following object is masked from ‘package:stringr’:

fixed

The following object is masked from ‘package:stringr’:

fixed

Attaching package: ‘SeqArray’

The following object is masked from ‘package:stringr’:

fixed

The following object is masked from ‘package:stringr’:

fixed

Attaching package: ‘SeqArray’

The following object is masked from ‘package:stringr’:

fixed

Number of SNPs: 13311 Number of samples: 560

conversion timing: 9 sec ##################################################### Again, at this point RStudio crashed. Any ideas?

pbpearman commented 5 years ago

I wondered if converting to plink and importing that into genomic_converter would produce a different result. Using the genome-aligned version, I made a chromosome map file with bcftools view and awk and then used --plink and the map to produce .ped and .bed files. I converted these to .bed, .bim, and .fam files, which are all in the same folder. I modified the .fam file to show the strata, and converted to .tped and .tfam files with: plink --bed data_b.bed --bim data_b.bim --fam data_b.fam --allow-extra-chr --out data --recode transpose The first line of the resulting .tped file starts out: Erumb1_s00000011 Erumb1_s00000011:30280 0 30280 0 0 0 0 0 0 0 0 T T T T T T T T 0 0 C... Then: eu_snps_2 <- converter2( data = "data.tped", common.markers = FALSE, parallel=6 ) produces: ####################################################################### ##################### radiator::genomic_converter ##################### ####################################################################### Function arguments and values: Working directory: /Volumes/pearman-1/lud11_docs/upv_research/projects/eriogonoideae/eriogonoideae/data/GBS/560_samples_20181029/mock/test Input file: data.tped Strata: no Population levels: no Population labels: no Output format(s): tidy Filename prefix: no Filters: Blacklist of individuals: no Blacklist of genotypes: no Whitelist of markers: no monomorphic.out: TRUE snp.ld: no common.markers: FALSE max.marker: no pop.select: no maf.thresholds: no

Imputations options: imputation.method: no

parallel.core: 6

####################################################################### [1] "the data type is Erumb1_s00000011"

Importing data

Error in stringi::stri_replace_allfixed(str = as.character(x), pattern = c("", : object 'input' not found

The value for data.type is the value of the entry in the first field of the first line of the .tped file.

thierrygosselin commented 5 years ago

Hi Peter, I've been doing lots of changes recently and I think the problem stem from a chunk of code that was sync that shouldn't be there. Anyway, I'm about to push a new commit and from there will see if you get the error.

Thanks for your patience Thierry

pbpearman commented 5 years ago

Hi Thierry, Here is an update. I went back and tried to repeat the earlier steps in my R Markdown file to the point of running again the initial Grur::missingness_visualization() that worked previously with the .vcf.gz file. It doesn't complete and I get the same "object 'input' not found error" when running missingness_visualization() as I do with genomic_converter(). In the intervening time, I installed the imputation tools in the rad_genomics_computer_setup guide. Previously, I had not installed those tools. Just before installing the tools, I updated all R packages in need of it. During the setup, the only problem that arose was with installation of LightGBM, which was resolved after submitting an issue.

I also tried running those functions with a different version of the .vcf.gz file that I had not touched since the initial successful run of missingness_visualization(). I get the same error message. So I don't think that the .vcf.gz file that I have been working with is corrupted.

Of course, I don't know much about the internal workings of either function, but the messages produced when trying the use the .vcf file, in which SeqArray and Stringr are listed, suggests that there is a compatibility issue with some other function. You'll know better than I do. I would be happy to provide you with any info or files you could find helpful.

pbpearman commented 5 years ago

The problem of crashes when reading .vcf files, both mine and the tutorial file, continued on the command line and in a terminal R session. Upgrade of RStudio didn't help. The problem disappeared with removal of R and RStudio from the system and new installation. No idea what that was about, but it seems to be resolved.