Closed IdoBar closed 4 years ago
Hi Ido, let me have a look...
1. the strata ... worked:
dart_strata <- readr::read_csv(file = "DArT_metadata.csv") %>%
dplyr::select(TARGET_ID = id, STRATA = State) %>%
dplyr::mutate(INDIVIDUALS = TARGET_ID)
You can verify that it's actually those in the DArT files using:
target.id <- radiator::extract_dart_target_id(data = "Report_DAsc19-4353_1_moreOrders_SNP_2.csv")
Note:
Having /
in your sample name might present some challenges in some packages/software, I don't remember testing it in radiator, so you might want to keep an eye on this...
2. silico dart data
The command below works
dir.create("output")
Arab_tidy_dart <- radiator::read_dart(
data = "Report_DAsc19-4353_1_moreOrders_SilicoDArT_1.csv",
strata = dart_strata,
path.folder = "output",
tidy.dart = TRUE
)
Note: Not sure why you're using the output of that function as the input of radiator::filter_rad
but it won't work.
3. filter_rad
Not sure what's your intentions to use the silico dart data inside filter_rad, but it won't work. It was not designed for this, I'll update the doc.
Ideally with DArT data you wan the count file (it's not automatically given by DArT, you have to ask). It's a file with coverage for both alleles. Similar to a VCF file with read depth info.
In the folder you have it's a genotype file with presence/absence in 2 rows, no count info, it works but it's limited for filtering for QC.
Below works for me:
Arab_dart <- radiator::filter_rad(
data = "Report_DAsc19-4353_1_moreOrders_SNP_2.csv",
strata = dart_strata,
output = c("genind"),
path.folder = "output"
)
Using radiator v.1.1.3 (I've just pushed a new version with new function detect_microsatellites)
Cheers Thierry
Thanks for the prompt reply.
It actually didn't work with DArT SNPs either, but I tested it now with the most recent build (and removed all the previous directories created by radiator) and it worked.
I also tried to convert the silico-dart table to genind
object and it failed.
Trying to convert straight from the silico-dart file:
Arab_genind <- genomic_converter("data/Report-DAsc19-4353_ArabieiPlate3/Report_DAsc19-4353_1_moreOrders_SilicoDArT_1.csv",
strata = dart_strata, output = c("genind"))
################################################################################
######################### radiator::genomic_converter ##########################
################################################################################
Execution date@time: 20200122@1324
Folder created: -118_radiator_genomic_converter_20200122@1324
Function call and arguments stored in: radiator_genomic_converter_args_20200122@1324.tsv
Filters parameters file generated: filters_parameters_20200122@1324.tsv
Importing data
Error in generate_strata(input, pop.id = TRUE) : object 'input' not found
Computation time, overall: 0 sec
Computation time, overall: 0 sec
######################### completed genomic_converter ##########################
Trying to read it in as Tidy format first and then convert to genind
Arab_tidy_dart <- read_dart("data/Report-DAsc19-4353_ArabieiPlate3/Report_DAsc19-4353_1_moreOrders_SilicoDArT_1.csv",
strata = dart_strata, output = c("genind"),
tidy.dart = TRUE, path.folder = "output")
Reading DArT file...
Number of blacklisted samples: 0
DArT SNP format: silico DArT
Synchronizing data and strata...
Number of strata: 6
Number of individuals: 281
Number of clones: 1841
Number of populations: 6
Number of individuals: 281
Number of ind/pop:
SA = 86
VIC = 58
WA = 48
SPAIN = 2
NSW = 65
QLD = 22
Number of duplicate id: 0
Computation time, overall: 2 sec
Arab_genind <- genomic_converter(Arab_tidy_dart,
strata = dart_strata, output = c("genind"))
################################################################################
######################### radiator::genomic_converter ##########################
################################################################################
Execution date@time: 20200122@0956
Folder created: -119_radiator_genomic_converter_20200122@0956
Function call and arguments stored in: radiator_genomic_converter_args_20200122@0956.tsv
Filters parameters file generated: filters_parameters_20200122@0956.tsv
Importing data
Synchronizing data and strata...
Number of strata: 6
Number of individuals: 281
Calibrating REF/ALT alleles...
The separator specified is not valid
Computation time, overall: 2 sec
Computation time, overall: 2 sec
Writing tidy data set:
radiator_data_20200122@0956.rad
Computation time, overall: 304 sec
Preparing data for output
Scanning for number of alleles per marker...
Data is multi-allelic
Generating adegenet genind object
Error in .local(.Object, ...) :
more than one '.' in column names; please name column as [LOCUS].[ALLELE]
In addition: There were 28 warnings (use warnings() to see them)
Computation time, overall: 315 sec
######################### completed genomic_converter ##########################
Thanks for looking into it.
Hi Ido, The silico dart data won't work in those functions, certainly not a format you can transform into a genind object. I suggest you read the DArT doc about the different format.
Currently, besides reading and tidying the silicons dart data, the only function in radiator you can use silico dart (will double check that) as input is in the sex markers function.
Why can't you transform them into a genind object?
They can fit the PA (present/absent) type of markers (look at the relevant section in adegenet
basics tutorial).
I'll write a function for converting them from a tidy format to genind and will share it if you're interested.
adegenet
as very limited use for them and I have no idea what to do with presence/absence data besides the sex markers function...
radiator::filter_rad
requires alleles and genotypes to do the filtering.
At one point I was developing the filtering functions to use genotype likelihood (GL/PL field in VCF) and stopped because most (~99%) of the datasets I saw using these where from low coverage experiments and where very very low quality. When you're doing this, you must have 100% control of the wet lab steps and bioinformatically experienced, at this point radiator is not really relevant.
Go for counts DArT data if you can for filter_rad
and if it's impossible for you to get it, genotypes in 2 rows is ok.
Below is just an example of the limit of presence/absence:
However, it is clear that the usual Euclidean distance (used in PCA and sPCA), as well as many other distances, is not as accurate to measure genetic dissimilarity using presence/absence data as it is when using allele frequencies. The reason for this is that in presence/absence data, a part of the information is simply hidden. For instance, two individuals possessing the same allele will be considered at the same distance, whether they possess one or more copies of the allele. This might be especially problematic in organisms having a high degree of ploidy.
Hi Thierry,
I've come back to
radiator
to process some DArT data and I'm glad to see how it matured in the last couple of years.I did came across a potential bug when filtering the data (both SNP and silico-dart). I could read in the data without a problem with
read_dart()
, but filtering is raising an error:This gives the following error:
The same error occurs if I use the SNPs rater than the silicoDArT markers. This is the output of
session_info()
:Dataset included below
DArT_set.zip