thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
58 stars 23 forks source link

Stacks vcf to rubias #161

Closed Imogen-D closed 1 year ago

Imogen-D commented 2 years ago

Hi,

Apologies if I miss anything out of this issue you need - I am still a relative newbee to coding!

I am trying to convert a vcf file to a rubias file. I have produced two vcfs from stacks, using the same population map, which have then been concatenated (with vcf tools).

The read vcf appears to run correctly, however write rubias throws an error. At the moment I am giving it a strata file for read vcf which has INDIVIDUALS and STRATA (stratain) as well as a different strata file for write rubias which has SAMPLE_TYPE, REPUNIT, COLLECTION, INDIVIDUALS (stratarub).

My log output file, script and strata files are attached. However, the vcf is very large so I am unsure how to attach it and takes a long time to subset, let me know what I can do about this.

Main Commands (with parallel core required as per some previous issues, which fixed an initial issue) - the rest in script. ` gdsframe <- read_vcf(vcf, strata=stratain, filename="rubias90filt", parallel.core=1L)

write_rubias(gdsframe, strata=stratarub, filename="rubias90filt", parallel.core=1L)`

Error below:

[############################## completed read_vcf ##############################
Error in if (length(markers) != ncol(data)) { :
  argument is of length zero
Calls: write_rubias
Execution halted](url)

devtools::sessioninfo()


─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       CentOS Linux 7 (Core)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Paris
 date     2022-06-17
 pandoc   NA

─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 brio          1.1.3   2021-11-30 [1] CRAN (R 4.1.2)
 cachem        1.0.6   2021-08-19 [1] CRAN (R 4.1.2)
 callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.2)
 cli           3.3.0   2022-04-25 [1] CRAN (R 4.1.2)
 crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.2)
 desc          1.4.1   2022-03-06 [1] CRAN (R 4.1.2)
 devtools      2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
 fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
 lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.2)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.1.2)
 pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.1.2)
 pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.1.2)
 processx      3.5.3   2022-03-25 [1] CRAN (R 4.1.2)
 ps            1.7.0   2022-04-23 [1] CRAN (R 4.1.2)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.2)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
 remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
 rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.2)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.1.2)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
 testthat      3.1.4   2022-04-26 [1] CRAN (R 4.1.2)
 usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
 withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
`

Thank you so much!

Imogen-D commented 2 years ago

Update:

Running on a smaller subset of data in an interactive session gives a clearer error that appears to be with the stacks input. Help would be much appreciated.

`################################################################################
######################### radiator::genomic_converter ##########################
################################################################################
Execution date@time: 20220623@1130
Folder created: 12_radiator_genomic_converter_20220623@1130
Function call and arguments stored in: radiator_genomic_converter_args_20220623@1130.tsv
Filters parameters file generated: filters_parameters_20220623@1130.tsv

Importing data: vcf.file

Reading VCF...

Data summary:
    number of samples: 17
    number of markers: 4

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 0 / 1 / 1

Strata with low sample size detected: fig <- FALSE

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 0 / 2 / 2

Generating individual stats...
Generating markers stats...
[==================================================] 100%, completed, 0s
[==================================================] 100%, completed, 0s

Number of chromosome/contig/scaffold: 1
Number of locus: 1
Number of markers: 1
Number of strata: 3
Number of individuals: 17

Number of ind/strata:
Y = 15
S = 1
D = 1

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20220623@1130.gds

Genotypes formats generated with 1 SNPs:
    GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): TRUE
    GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): TRUE
    GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): TRUE
    GT (the genotype coding 'a la genepop': 001002, 001001, 000000): TRUE
LOCUS field empty... adding unique id instead
HQ values are all missing: removing column

Tidy data file written: radiator_20220623@1130.rad

Computation time, overall: 6 sec

Writing tidy data set:
practising.rad

Preparing data for output

Data is bi-allelic
Generating rubias output
Error in data.table::melt.data.table(data = ., id.vars = cols, measure.vars = measure_vars,  :
  One or more values in 'id.vars' is invalid.
Calls: genomic_converter ... %<>% -> <Anonymous> -> %>% -> <Anonymous> -> <Anonymous>
In addition: Warning messages:
1: Unknown or uninitialised column: `LOCUS`.
2: Unknown or uninitialised column: `VARIANT_ID`.

Computation time, overall: 8 sec
######################### completed genomic_converter ##########################
Execution halted`
thierrygosselin commented 1 year ago

Sorry for the long delay Do you mind sharing the vsf so that I can fix the bug ? I'll reopen the issue when I have something to work with.

quinn-ca commented 1 year ago

I'm getting the same error trying to convert to rubias; my vcf (generated in Stacks) and strata files are attached! vcf_autosomal_genstr_filtered_maxMeanDP49_rmRelated_snps.vcf.zip sampleIDs_strata_rmRelated_maxMeanDP49.txt

I've tried the following recommendations in the question: https://github.com/thierrygosselin/radiator/issues/177 (substituting rubias format). The errors I get depending on the conversion code are:

Error in data.table::melt.data.table(data = ., id.vars = cols, measure.vars = measure_vars, : One or more values in 'id.vars' is invalid.

Error in if (length(markers) != ncol(data)) { : argument is of length zero

quinn-ca commented 1 year ago

I realized my strata file wasn't formatted correctly for rubias, so the revised file is attached. This doesn't change the error and I tried the conversion without a strata file (to no avail), but thought the correct file would be helpful if the conversion gets up and running. Thanks! rubias_strata_file.txt