frederikvand commented 3 years ago

Dear radiator admins,

currently I am unable to use any of my snp formats in radiator. Other outlier analyses from other packages work fine on my vcf, plink (.bed), genlight and genind formats. However, I would like to perform a bayescan in radiator.

Would you have any clue on why the radiator package is unable to read or transform my data formats?

from vcf

     Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx,  : One of the nodes produced an error: 
     FORMAT ID 'DPR' (Number=A) should have 1 value(s) but receives 2, please consider revising it to 'Number=.'.
                                                    FILE: 08_GMCF5_subset1_2020_11_23.vcf
                                                    LINE: 630, COLUMN: 10, 0/1:99.4743

These errors are the same when I use read_vcf but I never have a problem when I ready my data with the vcfR package.

from genind

     Error: New columns must be compatible with `.data`.
    x New column has 0 rows.
    i `.data` has 285 rows.

My genind and genlight formats work fine in the adegenet package.

from genlight

   Error: Can't rename columns that don't exist.
   x Column `ALT` doesn't exist.

from .bed

   Error in SeqArray::seqGetData(gdsfile = data, var.name = "$ref") : The GDS node "$ref" does not exist.

My .bed files work fine in the PCAdapt package.

Would you have any insights on how to transform my data to the bayescan format? I would prefer to stay within R.

With kind regards, Frederik Van Daele

thierrygosselin commented 3 years ago

Dear Frederik Sorry for the bug you're experiencing

I'll have a look in a couple of hours

Thierry

frederikvand commented 3 years ago

Thank you! Included you can find a subset of my vcf and .bed dataset to reproduce the issue.

05_filter_subset_5_RD5_7pop_50percind.zip

frederikvand commented 3 years ago

As I was unable to get my files converted to a bayescan format, I wrote a script to extract the neccesary information from my genlight objects. As radiator is also unable to read the text file as input I made a .bat script for windows to run BayeScan command line in parallel. The .bat script can then run on powershell in R with system("powershell -command E:/02_R/11_windows_shell/B_02_bayescan_powershell.bat").

thierrygosselin commented 3 years ago

On it today...

thierrygosselin commented 3 years ago

So the problem was from SeqArray not reading correctly the VCF file produced by freeBayes, it's now fixed and will be in the next release today (v. 1.1.9)

thierrygosselin commented 3 years ago

Re-open if after testing you still have an issue

Keep in mind that to use for bayescan you need to supply a strata file that highlight the grouping of the data (populations, etc.)

thierrygosselin commented 3 years ago

I tested this with the data you sent:

data <- radiator::read_vcf(data = "08_GMCF5_subset5_2020_11_23.vcf")
test1 <- radiator::detect_duplicate_genomes(data)
test2 <- radiator::detect_mixed_genomes(data)

I couldn't test the bed file because I didn't have the corresponding fam and bim files.

You might want to do the same and look at the figures produced by test1 and test2... don't waist your time with BayeScan with this current dataset, very basically:

you have duplicates (or technical replicates)
you have way too many missing data per individuals (not markers)

This will bias everything in downstream analysis...

Look at this function to filter your dataset

Best Thierry

thierrygosselin / radiator

Unable to convert or import any format to tidy_genomic_data or bayescan format. #100

from vcf

from genind

from genlight

from .bed