thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
59 stars 23 forks source link

write_fineradstructure function #76

Closed flojopi closed 4 years ago

flojopi commented 4 years ago

Hi Thierry,

I have DArT data, which I have filtered using the dartR package and now I am using the filtered, final genlight object to try and export it to fineRAD structure input format, but I keep getting the error:

> write_fineradstructure(tidy_rat8rec_reassign, rat8rec_reassign_fineRAD)
Error: Wrong genotype format: nucleotide information is required
try running the function radiator::tidy_genomic_data
Run `rlang::last_error()` to see where the error occurred.

and that was after I applied the tidy_genomic_data function:

> tidy_rat8rec_reassign <- tidy_genomic_data(data =rat8rec_reassign)
################################################################################
######################### radiator::tidy_genomic_data ##########################
################################################################################
Execution date@time: 20200307@1200
Folder created: -386_radiator_tidy_genomic_20200307@1200
Function call and arguments stored in: radiator_tidy_genomic_data_args_20200307@1200.tsv
Tidying the genlight object ...
Calibrating REF/ALT alleles...
Erasing genotype: no
Filters parameters file generated: filters_parameters_20200307@1200.tsv
Filters parameters file: initiated
################################################################################
######################### radiator::filter_monomorphic #########################
################################################################################
Execution date@time: 20200307@1201
Scanning for monomorphic markers...

Computation time, overall: 3 sec
######################### completed filter_monomorphic #########################
################################### RESULTS ####################################
Tidy data written in global environment
Data format: genlight
Biallelic data

Tidy genomic data:
    Number of markers: 17628
    Number of chromosome/contig/scaffold: 1
    Number of individuals: 257

Computation time, overall: 65 sec
######################### completed tidy_genomic_data ##########################

which appears to have worked fine.

> head(tidy_rat8rec_reassign)
# A tibble: 6 x 11
  MARKERS                      CHROM  LOCUS            POS   POP_ID INDIVIDUALS GT     REF   ALT   GT_VCF GT_BIN
  <chr>                        <chr>  <chr>            <chr> <fct>  <chr>       <chr>  <chr> <chr> <chr>   <int>
1 CHROM1__100101837_45_A_T__45 CHROM1 100101837_45_A_T 45    Goat_I 1           001001 A     C     0/0         0
2 CHROM1__100101860_54_A_G__54 CHROM1 100101860_54_A_G 54    Goat_I 1           001001 A     C     0/0         0
3 CHROM1__100102372_25_G_A__25 CHROM1 100102372_25_G_A 25    Goat_I 1           001001 A     C     0/0         0
4 CHROM1__100102406_63_C_T__63 CHROM1 100102406_63_C_T 63    Goat_I 1           001001 A     C     0/0         0
5 CHROM1__100102486_24_G_A__24 CHROM1 100102486_24_G_A 24    Goat_I 1           001001 A     C     0/0         0
6 CHROM1__100102536_55_T_C__55 CHROM1 100102536_55_T_C 55    Goat_I 1           001001 A     C     0/0         0

Is there anything missing in that tidy_rat8rec_assign that the function needs to work? I am not sure about the "nucleotide information is required". Does that mean it wants the whole SNP sequence, because REF and ALT are in that table.

I was also thinking that maybe something is wrong with that genlight object, so I used

raw_dart<- read_dart(data = "dartR_infiles/Report_DRa18-3782_SNP_2.csv",strata="dartR_infiles/dart.strata.tsv")
tidy_rawdart<-tidy_genomic_data(raw_dart)
write_fineradstructure(tidy_rawdart, rawdart_fineRAD)

and I get the same error, so its not like any information got lost during the filtering process.

Thanks, Flo

thierrygosselin commented 4 years ago

Hi Floriaan, you mind sending the genlight object rat8rec_reassign by email ?

thierrygosselin commented 4 years ago

There's a problem with the genlight, not the tidying process, at the step where the GDS file is produced...

looking into it...

thierrygosselin commented 4 years ago
  1. generating the GDS: there's a bug, now it's fix temporarily(It will need to be optimized, because now it requires the tidy.data, instead of just using the genlight(which is faster).

  2. the conversion to fineradstructure...

The function looks for a column in the tidy data: GT_VCF_NUC it's not generated by the tidying function because it can't find the information. The nucleotides in the genlight from dartR is stored in the LOCUS name...

thierrygosselin commented 4 years ago

it's also in slot @loc.all...

thierrygosselin commented 4 years ago

by default it will re-calibrate the alleles REF/ALT

thierrygosselin commented 4 years ago

The problem with your data and the conversion to fineradstructure are:

  1. you don't have haplotypes: so read again the manuscript, it's been a long time, not sure if you gain much compared to stockR...
  2. You seems to have more populations than letters in the alphabet ... The pop are converted to letters...

will have to sleep on this one and think about PLAN B.

Do you have the manual with input format detailed ? I could find one with those details about the letters for the pop... 🧐🤔

thierrygosselin commented 4 years ago

Ok so back on it and found a solution...

It become increasingly difficult for me to follow all the different naming schemes researcher uses, if they're is any strategy... Consequently, I have abandoned the idea of formating with letters the populations to generate the fineRADstrucure file.

Now, a couple of checks will generate errors if bad population ids are detected.

Adding the argument strata in the function will allow people to rename easily the strata, so that I no longer spend time on this ;)

In your case, the problem was: 1) more population than alphabet letters 2) and even if we skip that part and use your pop names, you have 1 pop that start with: _ or -: _Marlborough_Snd

thierrygosselin commented 4 years ago

Should work now with radiator v.1.1.5

Below different test you can make and see if normal you get an error or should pass. It also shows you the different ways to generate the output you want:

data <- radiator::tidy_genlight(
  data = "rat8rec_reassign.rds",
  gds = FALSE, # faster if you only want the fineradstructure file
  verbose = TRUE
  )
# works
test1 <- radiator::write_fineradstructure(data = data)
# error
test2 <- radiator::write_fineradstructure(data = data, strata = "floriaan.new.strata.tsv")
# works
test3 <- radiator::tidy_genlight(
  data = "rat8rec_reassign.rds",
  gds = FALSE
  ) %>%
  radiator::write_fineradstructure(data = ., strata = "floriaan.new.strata.tsv")
# works
test4 <- radiator::genomic_converter(data = "rat8rec_reassign.rds", output = "fineradstructure")
# error
test5 <- radiator::genomic_converter(
  data = "rat8rec_reassign.rds",
  strata = "floriaan.new.strata.tsv",
  output = "fineradstructure"
  )
thierrygosselin commented 4 years ago

FYI

Since it's dartR filtered data I went to have a look, out of curiosity:

  1. weird data, what is it ?
  2. you have duplicate samples
  3. you have sample with weird heterozygosity, remove them or you'll have false close kin generated (easy to see in the dup figures)...
  4. if the strata are the ones you really want, markers are really not in common, generating patterns of missing data = generating false structure...

Bottom line, I think it's premature to run population structure on the dataset as it is...

good luck Re-open the issue if you're still having problem

I'm emailing the strata that I used for the code to work properly Thierry