thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
58 stars 23 forks source link

Dart data and filter_rad() - 'coverage.info' not found #150

Closed neutronstar21 closed 1 year ago

neutronstar21 commented 2 years ago

Describe the bug

Hi Thierry, thanks for RADIATOR. I want to filter a Dart SNP mapping file using filter_rad() but I encounter this error and cant proceed:

function filter_rad() fails with error: "Error in extract_coverage(gds, markers = FALSE) : object 'coverage.info' not found"

I saw that you recently addressed a similar error "* bug fix using DArT data" and I believe I'm using the latest version of radiator (1.2.2) so not sure if its a related problem.

Thanks for your help.

To Reproduce

str_File_Name_METADATA = 'popmap_no_reps_by_min_missing_loci_perc_wo_2_inds__POP_STRATA_v2.tsv' str_File_Name_DATA = 'Report_DSha21-6120_SNP_mapping_2_DCB_w_4D_Sample_Code.csv'

filter_rad( data = str_File_Name_DATA, strata = str_File_Name_METADATA, interactive.filter = TRUE, output = NULL, filename = NULL, verbose = TRUE, parallel.core = parallel::detectCores() - 1 )

str_File_Name_DATA [1] "Report_DSha21-6120_SNP_mapping_2_DCB_w_4D_Sample_Code.csv" There were 50 or more warnings (use warnings() to see the first 50) str_File_Name_METADATA [1] "popmap_no_reps_by_min_missing_loci_perc_wo_2_inds__POP_STRATA_v2.tsv" filter_rad(

  • data = str_File_Name_DATA,
  • strata = str_File_Name_METADATA,
  • interactive.filter = TRUE,
  • output = NULL,
  • filename = NULL,
  • verbose = TRUE,
  • parallel.core = parallel::detectCores() - 1
  • ) ################################################################################ ############################# radiator::filter_rad ############################# ################################################################################ Execution date@time: 20220117@0928 Folder created: filter_rad_20220117@0928 Function call and arguments stored in: radiator_filter_rad_args_20220117@0928.tsv File written: random.seed (14058)
    Filters parameters file generated: filters_parameters_20220117@0928.tsv Reading DArT file...
    Number of blacklisted samples: 0
    DArT SNP format: genotypes in 1 Row Generating genotypes and calibrating REF/ALT alleles... Number of markers recalibrated based on counts of allele: 2972 Generating GDS... File written: radiator_20220117@0928.gds.rad

Number of chrom: 1 Number of locus: 18882 Number of SNPs: 20675 Number of strata: 1 Number of individuals: 519

Number of ind/strata: NSW = 519

Number of duplicate id: 0

Computation time, overall: 29 sec ################################################################################ #################### radiator::filter_dart_reproducibility ##################### ################################################################################ Execution date@time: 20220117@0928 Function call and arguments stored in: radiator_filter_dart_reproducibility_args_20220117@0928.tsv

Interactive mode: on 2 steps to visualize and filter the data based on reproducibility: Step 1. Visualization Step 2. Choose the filtering threshold

File written: dart_reproducibility_stats.tsv
File written: dart_reproducibility_boxplot_20220117@0928.pdf Generating helper table... Files written: helper tables and plots

Step 2. Filtering markers based on markers reproducibility

Do you still want to blacklist markers? (y/n): n

Computation time, overall: 11 sec #################### completed filter_dart_reproducibility ##################### ################################################################################ ######################### radiator::filter_monomorphic ######################### ################################################################################ Execution date@time: 20220117@0928 Function call and arguments stored in: radiator_filter_monomorphic_args_20220117@0928.tsv File written: whitelist.polymorphic.markers_20220117@0928.tsv
################################### RESULTS ####################################

Filter monomorphic markers Number of individuals / strata / chrom / locus / SNP: Before: 519 / 1 / 1 / 18882 / 20675 Blacklisted: 0 / 0 / 0 / 0 / 0 After: 519 / 1 / 1 / 18882 / 20675

Computation time, overall: 1 sec ######################### completed filter_monomorphic ######################### ################################################################################ ####################### radiator::filter_common_markers ######################## ################################################################################ Execution date@time: 20220117@0928 Function call and arguments stored in: radiator_filter_common_markers_args_20220117@0928.tsv Scanning for common markers... Filter common markers: only 1 strata, returning data

Computation time, overall: 0 sec ####################### completed filter_common_markers ######################## ################################################################################ ######################### radiator::filter_individuals ######################### ################################################################################ Execution date@time: 20220117@0928 Function call and arguments stored in: radiator_filter_individuals_args_20220117@0928.tsv Interactive mode: on

Step 1. Visualization Step 2. Missingness Step 3. Heterozygosity Step 4. Coverage (if available)

Step 1. Visualization of samples QC

Error in extract_coverage(gds, markers = FALSE) : object 'coverage.info' not found In addition: There were 50 or more warnings (use warnings() to see the first 50)

Computation time, overall: 0 sec ######################### completed filter_individuals #########################

Computation time, overall: 43 sec ############################# completed filter_rad #############################

[1] M:/ENVI/installed/R/R-4.1.1/library

  • complete data required to reproduce the problem, a subset of it. The data remains confidential.

popmap_no_reps_by_min_missing_loci_perc_wo_2_inds__POP_STRATA_v2.zip

Report_DSha21-6120_SNP_mapping_2_DCB_w_4D_Sample_Code.zip

thierrygosselin commented 1 year ago

Dear Dean, Sorry for the late reply on this.

If this is still relevant, try installing the latest radiator.

Couple of steps before using filter_rad

  1. try the separate function for DArT data:
library(radiator)
shark <- radiator::read_dart(
  data = "Report_DSha21-6120_SNP_mapping_2_DCB_w_4D_Sample_Code.csv",
  strata = "popmap_no_reps_by_min_missing_loci_perc_wo_2_inds__POP_STRATA_v2.tsv",
  verbose = TRUE
  )

It's a bit long but worth waiting for the speed GDS gives after...

################################### SUMMARY ####################################

Number of chrom: 1
Number of locus: 18882
Number of SNPs: 20675
Number of strata: 1
Number of individuals: 519

Number of ind/strata:
NSW = 519

Number of duplicate id: 0

Computation time, overall: 110 sec

Next, what I like to do when I receive a data set is to check a couple of QC steps

  1. radiator::detect_duplicate_genomes will check for technical duplicates (DArT use those) but with most project, I see a lot of wet lab problems or sampling errors...
test1 <- radiator::detect_duplicate_genomes(data = shark)

In the folder it will generate, go check this figure: manhattan.plot.distance.png

Answer n to the question on the console, don't blacklist samples just yet...

  1. radiator::detect_mixed_genomes look at individual heterozygosity. It's a good way for me to see if the problem of data quality comes from. Usually, it's wet lab, sometimes it's tissue quality, I've seen this with white shark data coming from DNA tissue taken from death shark from beach protective net. You just don't how long the dead shark was in the water...
test2 <- radiator::detect_mixed_genomes(data = shark)

The interesting figure here is individual.heterozygosity.manhattan.plot.png. This shows the outlier samples in the data...

Too many het locus in an individual makes it look closer to everybody (he share an allele with everyone), to few is usually because of bad DNA, lots of missing data (big bubble, below the IQR, if you like box plot).

When you have tiny bubbles higher in the figure that stands out it's a number of things:

I usually filter out the duplicates and the mixed individuals inside filter_rad but here you could already take those out just to see the outcome...

To the question on the console:

Inspect plots and tables in folder created...
    Do you want to exclude individuals based on heterozygosity ? (y/n): 

answer: y

The threshold I use here are based on the figure (the subtitle: overall data outlier thresholds (low/high): 0.166314/0.199396...

    Enter the min value for ind.heterozygosity.threshold argument (0 turns off): 

use: 0.166314

    Enter the max value for ind.heterozygosity.threshold argument (1 turns off): 

use: 0.199396

Filter individual's heterozygosity: 35 individual(s) blacklisted
################################### RESULTS ####################################
Detect mixed genomes: 0.166314 0.199396
Number of individuals / strata / chrom / locus / SNP:
    Before: 519 / 1 / 1 / 18882 / 20675
    Blacklisted: 35 / 0 / 0 / 0 / 0
    After: 484 / 1 / 1 / 18882 / 20675

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 0 / 1313 / 1610

Computation time, overall: 27 sec
######################## completed detect_mixed_genomes ########################
  1. Re-run radiator::detect_duplicate_genomes
test3 <- radiator::detect_duplicate_genomes(data = test2)

You will see how much the figure as changed, for the better. Test1 you did showed samples that are clearly duplicates (the bubble close to 0. The ones above, 0.25 ... not normal. It should be below 0.5 unless you sampled families of shark all linked or related ...

The test3 is what I would expect from a normal DArT dataset.

Inspect tables and figures to decide if some individual(s) need to be blacklisted
    Do you need to blacklist individual(s) (y/n): 

type: y

2 options to remove duplicates:
    1. threshold: using the figure you choose a threshold. It's more powerful to fully remove duplicates
    2. manually: the function generate a blacklist that you have to complete
    Note: not sure ? Use option 1, it's more powerful to fully remove duplicates
    Enter the option to remove duplicates (1/2): 

use : 1

Enter the threshold to remove duplicates: (between 0 and 1)

use : 0.25

2 options to remove duplicates involved in pairs from different strata/group:
    (the black points on the figure, above your threshold)
    1: blacklist both samples in the pair
    2: blacklist only 1 sample, based on missingness
    Enter 1/2: 

I would use: 2 but they are times I use: 1

With threshold selected, 15 individual(s) blacklisted                                                    
Written in the directory: blacklist.id.similar.tsv                  
Blacklisted individuals: 15 ind.
    Filtering with blacklist of individuals
################################### RESULTS ####################################
Detect duplicate genomes: 0.25
Number of individuals / strata / chrom / locus / SNP:
    Before: 484 / 1 / 1 / 17569 / 19065
    Blacklisted: 15 / 0 / 0 / 0 / 0
    After: 469 / 1 / 1 / 17569 / 19065

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 0 / 64 / 74

Computation time, overall: 171 sec
###################### completed detect_duplicate_genomes ######################

**the files in the folders with .tsv can be open in e.g. excel and you can easily check the samples that are considered duplicates...

The samples above 0.75 around the 0.50 line are probably close kin ... but I would clean my dataset first with filter_rad...

email me if you have questions, open the issue if you still have problems with radiator...