thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
58 stars 23 forks source link

Radiator DArT Counts error #146

Closed lukelloydjones closed 2 years ago

lukelloydjones commented 2 years ago

Dear Thierry,

We are currently trying to QC a DArT 'counts' file. We are getting the following error at the radiator::filter_individuals step. We think we have a problem with our setup re. package version etc. as Pierre Feutry (not sure you know him but I think so) can run it on his machine. We have spent a bit of time tracing the error but at point where think we may need some help.

Any help would be most appreciated.

Kind regards,

Luke

https://www.dropbox.com/sh/lfxjn3ki4i8e8f1/AADjy1s8ik0oR9X70tGEe2RCa?dl=0 dart.pth <- "~/Dropbox/CSIRO/ckmr/crocodiles/summer_project/analysis/data/raw_dart/counts_format/SEQ_SNPs_counts_0_Target-two_part_id.csv" sta4 <- "~/Dropbox/CSIRO/ckmr/crocodiles/summer_project/analysis/data/radiator/strata/croc_strata_counts.tsv" croc <- filter_rad(data = dart.pth, strata = sta4, output = "genind", filename = "crocs") Step 1. Visualization of samples QC Error: Problem with `filter()` input `..1`. Input `..1` is `MARKERS %in% markers`. x object 'MARKERS' not found Run `rlang::last_error()` to see where the error occurred. > rlang::last_error() Problem with `filter()` input `..1`. Input `..1` is `MARKERS %in% markers`. x object 'MARKERS' not found Backtrace: 1. radiator::filter_rad(...) 13. base::.handleSimpleError(...) 14. dplyr:::h(simpleError(msg, call)) Run `rlang::last_trace()` to see the full context. > rlang::last_trace() Problem with `filter()` input `..1`. Input `..1` is `MARKERS %in% markers`. x object 'MARKERS' not found Backtrace: █ 1. ├─radiator::filter_rad(...) 2. │ └─radiator::filter_individuals(...) 3. │ └─radiator::generate_id_stats(...) 4. │ └─radiator::extract_coverage(gds, markers = FALSE) 5. │ └─radiator::extract_genotypes_metadata(...) 6. │ └─`%<>%`(...) 7. ├─dplyr::filter(., MARKERS %in% markers, INDIVIDUALS %in% individuals) 8. ├─dplyr:::filter.data.frame(...) 9. │ └─dplyr:::filter_rows(.data, ..., caller_env = caller_env()) 10. │ ├─base::withCallingHandlers(...) 11. │ └─mask$eval_all_filter(dots, env_filter) 12. ├─MARKERS %in% markers 13. └─base::.handleSimpleError(...) 14. └─dplyr:::h(simpleError(msg, call)) devtools::session_info() ─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 4.1.2 (2021-11-01) os macOS Big Sur 11.6.2 system x86_64, darwin17.0 ui RStudio language (EN) collate en_AU.UTF-8 ctype en_AU.UTF-8 tz Australia/Brisbane date 2021-12-22 rstudio 2021.09.1+372 Ghost Orchid (desktop) pandoc NA ─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date (UTC) lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.0) BiocGenerics 0.40.0 2021-10-26 [1] Bioconductor BiocManager 1.30.16 2021-06-15 [1] CRAN (R 4.1.0) Biostrings 2.62.0 2021-10-26 [1] Bioconductor bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0) bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0) bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0) broom 0.7.10 2021-10-31 [1] CRAN (R 4.1.0) cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0) callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0) cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.0) colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0) crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.0) data.table 1.14.2 2021-09-27 [1] CRAN (R 4.1.0) DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.2) desc 1.4.0 2021-09-28 [1] CRAN (R 4.1.0) devtools 2.4.3 2021-11-30 [1] CRAN (R 4.1.0) digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.0) dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0) fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.0) fst 0.9.4 2020-08-27 [1] CRAN (R 4.1.0) gdsfmt 1.30.0 2021-10-26 [1] Bioconductor generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.0) GenomeInfoDb 1.30.0 2021-10-26 [1] Bioconductor GenomeInfoDbData 1.2.7 2021-12-20 [1] Bioconductor GenomicRanges 1.46.1 2021-11-18 [1] Bioconductor ggplot2 3.3.5 2021-06-25 [1] CRAN (R 4.1.0) glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.0) gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) HardyWeinberg 1.7.4 2021-11-26 [1] CRAN (R 4.1.0) hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.0) IRanges 2.28.0 2021-10-26 [1] Bioconductor labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0) lattice 0.20-45 2021-09-22 [1] CRAN (R 4.1.2) lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0) magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.0) mice 3.14.0 2021-11-24 [1] CRAN (R 4.1.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) nnet 7.3-16 2021-05-03 [1] CRAN (R 4.1.2) pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.0) pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.2) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) pkgload 1.2.4 2021-11-30 [1] CRAN (R 4.1.0) plyr 1.8.6 2020-03-03 [1] CRAN (R 4.1.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0) ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0) radiator * 1.2.2 2021-12-20 [1] Github (thierrygosselin/radiator@20679cc) Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) RCurl 1.98-1.5 2021-09-17 [1] CRAN (R 4.1.0) readr 2.1.1 2021-11-30 [1] CRAN (R 4.1.0) remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.0) rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.0) rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) Rsolnp 1.16 2015-12-28 [1] CRAN (R 4.1.0) rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) S4Vectors 0.32.3 2021-11-21 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) SeqArray 1.34.0 2021-10-26 [1] Bioconductor sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.0) SNPRelate 1.28.0 2021-10-26 [1] Bioconductor stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.0) testthat 3.1.1 2021-12-03 [1] CRAN (R 4.1.0) tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.0) tidyr 1.1.4 2021-09-27 [1] CRAN (R 4.1.0) tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) truncnorm 1.0-8 2018-02-27 [1] CRAN (R 4.1.0) tzdb 0.2.0 2021-10-27 [1] CRAN (R 4.1.0) UpSetR 1.4.0 2019-05-22 [1] CRAN (R 4.1.0) usethis 2.1.5 2021-12-09 [1] CRAN (R 4.1.0) utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) vroom 1.5.7 2021-11-30 [1] CRAN (R 4.1.0) withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.0) XVector 0.34.0 2021-10-26 [1] Bioconductor zlibbioc 1.40.0 2021-10-26 [1] Bioconductor [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
thierrygosselin commented 2 years ago

Dear Luke Sorry for the bug you're experiencing

I'll have a look today Thierry

lukelloydjones commented 2 years ago

Hi Thierry,

No problem. Probably something wrong on our end. Thanks very much for looking into it so quickly.

Cheers,

Luke

From: Thierry Gosselin @.> Reply to: thierrygosselin/radiator @.> Date: Wednesday, 22 December 2021 at 9:37 am To: thierrygosselin/radiator @.> Cc: Mr Luke Lloyd-Jones @.>, Author @.***> Subject: Re: [thierrygosselin/radiator] Radiator DArT Counts error (Issue #146)

Dear Luke Sorry for the bug you're experiencing

I'll have a look today Thierry

— Reply to this email directly, view it on GitHubhttps://github.com/thierrygosselin/radiator/issues/146#issuecomment-999166293, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACL2HY4AMAETCB5JXS6OYALUSEFSVANCNFSM5KRGPMOA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>

thierrygosselin commented 2 years ago

Hi Luke, Could you run the command below and tell me if you have the same results that I get ...

test1 <- radiator::read_dart(data = "SEQ_SNPs_counts_0_Target-two_part_id.csv", strata = "croc_strata_counts.tsv")
Reading DArT file...                                                
Number of blacklisted samples: 1                                                                         
DArT SNP format: alleles coverage in 2 Rows counts
Generating genotypes and calibrating REF/ALT alleles...
Number of markers recalibrated based on counts of allele read depth: 1393
Generating GDS...
File written: radiator_20211222@0948.gds.rad                        

Number of chrom: 1
Number of locus: 6125
Number of SNPs: 6760
Number of strata: 10
Number of individuals: 1297

Number of ind/strata:
5b__Coastal-Plains-CA = 288
3__North-East-Cape-York = 19
2__North-West-Cape-York = 459
4__Princess-Charlotte-Bay = 259
1b__Gulf-Plains-ALD = 8
5a__Coastal-Plains-CMC = 2
6b__Fitzroy = 24
1c__Gulf-Plains-NFD = 115
5c__Coastal-Plains-APrR = 110
1d__Gulf-Plains-MGD = 13

Number of duplicate id: 0

Computation time, overall: 27 sec
thierrygosselin commented 2 years ago

With the latest push, it should work, re-install radiator for this.

thierrygosselin commented 2 years ago

Suggestions for your dataset and ask Pierre for further advices, he knows how to deal with those problems

  1. The way populations / stratifications and individuals are named = it's asking for problems (read this).
  2. You have bias sample size between your strata. This requires extra checks. e.g. only 2 samples in the strata 5a__Coastal-Plains-CMC and 459 samples in 2__North-West-Cape-York

I suggest running the filter with this, because using common markers between strata is by default: on

test2 <- radiator::filter_rad(
  data = "SEQ_SNPs_counts_0_Target-two_part_id.csv",
  strata = "croc_strata_counts.tsv",
  filter.common.markers = FALSE
  )
  1. Not sure if the data was filtered before with another tool, but you should get around 3000 markers and you'lll have to get rid of the duplicate samples you have.
  2. For the 1st filter (reproducibility), I would use this:
Do you still want to blacklist markers? (y/n):
  y
2 options to blacklist markers based on reproducibility:
  1. use the outlier statistic
2. enter your own threshold
Enter the option (1 or 2): 
  1
  1. For filter_individuals based on different metrics, I would use these:
    
    Step 2. Filtering markers based individual missingness/genotyping

Do you want to blacklist samples based on missingness ? (y/n): y 2 options to blacklist samples:

  1. based on the outlier statistics
  2. enter your own threshold 1

And would turn off the rest and wait downstream **filter_rad** to view the figures before choosing the value:

Step 3. Filtering markers based on individual heterozygosity

Do you want to blacklist samples based on heterozygosity ? (y/n): n

Step 4. Filtering markers based on individual's coverage

Do you want to blacklist samples based on TOTAL coverage ? (y/n): n Do you want to blacklist samples based on MEDIAN coverage ? (y/n): n Do you want to blacklist samples based on Interquartile Range (IQR) coverage ? (y/n): n



**6. Coverage**: be careful with uneven coverage between samples, I suspect you have varying quality of DNA and/or wet lab problem with some (all Pierre).
**7. Duplicate samples:** you have technical replicates or duplicates and close-kin in the data (but close kin analysis should be done after removing all outliers based on heterozygosity, otherwise it's artefacts not biological).
**8. You have mixed samples**, very very different DNA quality or biological interesting phenomenon (bottleneck, inbreeding, outbreeding, 2 species, etc). this is observed with the individual heterozygosity (look for the range and bubble size that correlates with missing data). When I see a figure like this I usually bet that 95% is due to wet-lab problems with DNA. But I could be wrong. Pierre white shark dataset was similar.

If you email me I'll send a PDF of a talk I gave in Hobart about the different filtering steps... it's the missing manual.

Cheers
Thierry