thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
58 stars 23 forks source link

tidy_genind() sample bug #153

Closed ntakebay closed 1 year ago

ntakebay commented 2 years ago

With radiator 1.1.9 2020-12-12, I noticed a small bug in tidy_genind() function in genind.R https://github.com/thierrygosselin/radiator/blob/master/R/genind.R.

Line 146-147 is like this:

A2 <- TRUE %in% unique(stringi::stri_detect_fixed(str = sample(colnames(data@tab), 100), pattern = ".A2")) if (isTRUE(TRUE %in% unique(stringi::stri_detect_fixed(str = sample(colnames(data@tab), 100), pattern = ".A2__")))) {

The problem occurs when the number of columns in tab of genInd object is less than 100;it causes the following error:

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

It is better to replace "str = sample(..)" with something like this:

str = sample(colnames(data@tab), min(100, ncol(data@tab)))

I was playing with a small test data, and noticed this bug.

I'm running it on R version 4.1.2 on x86_64 linux.

Thank you, Naoki

ntakebay commented 2 years ago

Here is another bug when @pop of genInd is NULL.

l. 68-72 is like this:

if (is.null(data@pop)) { if (verbose) message(" strata: no") if (verbose) message(" 'pop' will be added") strata %<>% dplyr::mutate(STRATA = "pop") }

Before the closing curly braces, you can add:

pop(data) <- rep("pop",nInd(data))

And it works.

Here is the reproducible code to trip these two bugs:

library(adegenet) library(radiator) test <- data.frame(loc1=c("a/a","a/t", "g/t"),loc2=c("g/c",NA,"c/c"),loc3=c("g/c","g/g","c/c")) rownames(test)<-c("ind1","ind2","ind3")

tgi <- df2genind(test,sep="/")

tidy_genind(tgi)

ntakebay commented 2 years ago

Finally, there is an additional bug, when data is biallelic (the example used above isn't biallelic) and gds=T.

Here is the reproducible code:

library(adegenet) library(radiator) test <- data.frame(loc1=c("a/a","a/t", "t/t"),loc2=c("g/c",NA,"c/c"),loc3=c("g/c","g/g","c/c")) rownames(test)<-c("ind1","ind2","ind3")

tgi <- df2genind(test,sep="/")

tidy_genind(tgi)

end of code

It core dumped with the following error message:

No sample in '/scratch/test/radiator_temp_20220301@1454.vcf' /usr/include/c++/11/bits/stl_vector.h:1045: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = std::cxx11::basic_string; _Alloc = std::allocator<std::cxx11::basic_string >; std::vector<_Tp, _Alloc>::reference = std::cxx11::basic_string&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion 'n < this->size()' failed. Aborted (core dumped)

Line 127 of genind.R causes the problem:

gds.filename <- radiator_gds(
  data.source = "genind",
  genotypes = geno,
  strata = strata,
  biallelic = TRUE,
  markers.meta = markers.meta,
  filename = NULL,
  verbose = verbose
)

When I look at gds.R, line 53 is like this:

SeqArray::seqVCF2GDS( vcf.fn = paste0(temp.file, ".vcf"), out.fn = filename, parallel = 1L, storage.option = "ZIP_RA", verbose = FALSE )

And this is the place where it core dumped. SeqArray package is version 1.34.0.

thierrygosselin commented 1 year ago

Hi Naoki,

For now I'll keep things working for biallielic data with genind. There are so many flavours in multi allelic that are beyond the scope of radiator.

As for the sample size, I've modified the code to account for this, although I'm not sure it's going to work well in the other function. Again, radiator was design for large RAD data, ideally VCFs with read depth info. genind or genlight are not the ideal format, but I understand the need to be able to run DACP etc...

I'll see what I can do for the next push

Best Thierry

thierrygosselin commented 1 year ago

I'm also surprised that you would have a genind object without populations... usually adegenet is used for population analysis...