thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
59 stars 23 forks source link

filter_rad issues #44

Closed kevinmneal closed 5 years ago

kevinmneal commented 5 years ago

Hi Thierry,

A few small bugs I've noticed with filter_rad() in interactive mode:

-Any filter that removes samples/individuals causes failure in the next filtering step:

Error: Column `MISSING_PROP` must be length 203 (the number of rows) or one, not 204

-detect_mixed_genomes doesn't abide by the parallel.core arg in filter_rad(); fixed by adding parallel.core = parallel.core in filter_rad:

gds <- detect_mixed_genomes(data = gds, interactive.filter = interactive.filter, 
    detect.mixed.genomes = detect.mixed.genomes, ind.heterozygosity.threshold = NULL, 
    parameters = filters.parameters, verbose = verbose, parallel.core = parallel.core,
    path.folder = wf, internal = FALSE) 

-May need to remove strata=NULL from filter_hwe in filter_rad():

gds <- filter_hwe(data = gds, interactive.filter = interactive.filter, 
    filter.hwe = filter.hwe, strata=NULL, hw.pop.threshold = hw.pop.threshold, 
    midp.threshold = midp.threshold, parallel.core = parallel.core, 
    parameters = filters.parameters, path.folder = wf, verbose = verbose, 
    internal = FALSE)

-when filtering by HWE, interactive mode doesn't always detect the asterisk inputs; I think this happened when I tried setting hw.pop.threshold equal to the number of pops, or it may happen when some strata are removed for having n < 10, but I don't remember exactly. I just tried to re-run on strata where none were removed and didn't get the error.

-in general, is there a way to exit the interactive mode? When the HWE filter couldn't detect my inputs, I had to restart the R session to get out.

-Transferring to genomic_converter requires doing the REF/ALT calibration again. Not a major issue but adds some time.

-purely aesthetic, but when running on Windows, the font choice in the plots (Helvetica?) causes warnings:

In grid.Call(C_textBounds, as.graphicsAnnot(x$label),  ... :
  font family not found in Windows font database

Not issues, but questions/suggestions/requests: -Are there better explanations for how outliers/q75/iqr are calculated and applied? Is outliers just outside 95% CI? -filter_coverage step returns a plot of max mean coverage; a plot for min mean would be useful -In filter_genotyping, is the threshold applied per-strata or only on the total? -Long LD filtering appears to work but only if pruned WITHOUT missing data statistics when CHROMs represent contigs; pruning with missing data statistics doesn't remove anything. Is there a reason for this? Actually I'm not sure loci are pruned either way. Is it possible to collapse the CHROMs down to a single CHROM to do the long LD filtering? -outputting the full function call with args entered during the interactive session would help with reproducibility -asking if you want to run a particular filtering step interactively, e.g. asking if you want to skip calculating HWE since it takes a long time

thierrygosselin commented 5 years ago

Hi Kevin, thanks a lot for the bug report. I don't have a PC nearby so it's very useful, I'm giving a workshop in 2 weeks and 19 out of the 25 participants have PCs...

I'll break the reply on topics...

1. Font Helvetica: just realized that my favorite font is not pre-installed on PCs. Snif. Will find a replacement found in all 3 OS...

thierrygosselin commented 5 years ago
thierrygosselin commented 5 years ago

a plot for min mean would be useful It's on my todo list to have the helper plot, it's there, just not outputting it. But it's less important or rational, because it all falls down to your comfort with low coverage... Personally I don't go below 7, because of null allele, heterozygosity, etc.

3- I was curious because, in ipyrad I set minimum coverage to 6, but I seem to lose a lot of loci if I set the min here to 6. I've just been setting it to 1 for now.

I'll check that at the same time. But it wouldn't be the first time a pipeline says it's filtering x and it doesn't... do you know if it's <= or <, because for your dataset that would be a big deal (see also my observation/comment of Minot Allele Count below...

You could try this with your dataset and the latest release:

data <- radiator::tidy_vcf(data = "OCall208_75pctmsng.recode.vcf",
                           strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
                           filter.common.markers = FALSE,
                           verbose = TRUE)

cov.markers <- data %>%
  dplyr::group_by(MARKERS) %>%
  dplyr::summarise(
    MEAN = mean(READ_DEPTH, na.rm = TRUE),
    MEDIAN = median(READ_DEPTH, na.rm = TRUE)
    )

dplyr::filter(cov.markers, MEAN <= 6) %>% nrow(.)
dplyr::filter(cov.markers, MEAN < 6) %>% nrow(.)
dplyr::filter(cov.markers, MEDIAN <= 6) %>% nrow(.)
dplyr::filter(cov.markers, MEDIAN < 6) %>% nrow(.)

You should get these numbers for the unfiltered dataset:

2829
2826
3990
3680

Now both helper plot (low and high are printed) Using these thresholds:

Filter mean coverage thresholds: 7 / 100
Number of individuals / strata / chrom / locus / SNP:
    Before: 208 / 26 / 5600 / 5600 / 6965
    Blacklisted: 0 / 0 / 1773 / 1773 / 2248
    After: 208 / 26 / 3827 / 3827 / 4717

There's definitely things to test with that filter and downstream analysis. I know some people who wouldn't do closekin with markers below 20 of coverage ... but that's up for debate...

thierrygosselin commented 5 years ago

The reason I show the missing per strata, the mean value and overall is that it helps to see if the populations are "equal". You can already predict if a pattern of missingness is present from this filter.

Remember, their is nothing missing unless you compare it with something...

The course of action when some populations drags down the polymorphism discovery is to remove them, do the downstream analysis. Re-introduce the population and see if it changes something.

The impact depends on the analysis:

Update: so that filter no longer crash RStudio on my end when the number of samples for a pop is below the number of CPU used...

thierrygosselin commented 5 years ago

You should see it if verbose = TRUE for main function. After, for the modules inside it's printed in the file with _args_ in it.

Additionally, the values for threshold, along CHROM/LOCUS/SNP kept and blacklisted for each filtering steps are written in the filters_parameters file in the 01_radiator folder

thierrygosselin commented 5 years ago

6. asking if you want to run a particular filtering step interactively, e.g. asking if you want to skip calculating HWE since it takes a long time

Most are fast enough: I usually test with 2M markers

As for HWE it's the only filter I'm not confortable 100% in removing markers. You really have to know what you're doing. Consequently, you should get the message Do you want to continue with the filtering ? after the plots are produced...

More on that filter below...

The only other now I saw useful to do it is the pruning of markers based on long ld... it also ask if you want to proceed or not. Because it's usually fast with a reference genome but can be long for de novo data.

thierrygosselin commented 5 years ago

It should work. No need to collapse CHROM. Likely a bug. with your VCF. I'm testing it with and without a genome, so I'll run it again to see if I broke something. WITH MISSING is currently only available WITH a reference genome. It's a computational issue for now.

The problem you're seeing is it with the VCF you sent me yesterday ? If so I'll do the test with it.

update so do you really have a reference genome or an approach using a catalog of individuals as pseudo-reference ? It looks like the info in CHROM are more like locus than scaffolds. You have 6000 of them almost as much as the markers.

Ok so I checked ipyrad doc and the options are described here

options:

  1. Change the CHROM info: I'll make available a series of codes in the function: filter_ld examples that will show how to easily change the CHROM column in your GDS object and/or tidy data. So you could skip the long LD step in filter_rad... and do it separately in filter_ld.
  2. Tell me if there is a better way to code the CHROM right from the start...

Code to change the chromosome information:

#Change the CHROM column-------------------------------------------------------
gds <- read_rad("full path to gds or .rad.gds file")
radiator::summary_gds(gds)

markers.meta <- radiator::extract_markers_metadata(gds)

# Keep a back up
markers.meta.bk <- markers.meta

# change the CHROM column
markers.meta %<>%
  dplyr::mutate(CHROM = "chrom_1")

# Update the GDS markers metadata (this is the radiator part)
update_radiator_gds(
  gds = gds,
  node.name = "markers.meta",
  value = markers.meta,
  replace = TRUE
)

# Update the chromosome info in GDS (GDS, SeqArray part)
update_radiator_gds(
  gds = gds,
  radiator.gds = FALSE,
  node.name = "chromosome",
  value = markers.meta$CHROM,
  replace = TRUE
)
#Ready for LD without the code running by scaffolds...
thierrygosselin commented 5 years ago

8. Transferring to genomic_converter requires doing the REF/ALT calibration again. Not a major issue but adds some time.

Yes, a pain. I have a fix for this that tracks if it's been done. It's always a problem with low coverage data + the removal of samples. It generates more monomorphic markers and requires recalibrating REF/ALT...

my plan in the next couple of weeks is to have a C++ code do it faster.

thierrygosselin commented 5 years ago

9.in general, is there a way to exit the interactive mode? When the HWE filter couldn't detect my inputs, I had to restart the R session to get out.

Sadly, no easy way. I have a solution in the vignette that shows how to start with the GDS file, removes some filters, etc.

The other solution might be for you to just use the filter modules you want and use magrittr %>% in-between.

Todo list: use shiny and let use pick filters on the left side and click and enter thresholds as the modular pipeline goes...

thierrygosselin commented 5 years ago

10. when filtering by HWE, interactive mode doesn't always detect the asterisk inputs; I think this happened when I tried setting hw.pop.threshold equal to the number of pops, or it may happen when some strata are removed for having n < 10, but I don't remember exactly. I just tried to re-run on strata where none were removed and didn't get the error.

Bug I'll check the combination mentioned above. It's not for n < 10 (that's tested) not tested is when you enter to the same number of pops.

Usually with asterisk, if entering e.g. ***** returns no filtering, the interactive version will prompt to enter a new value... that might trigger an ever ending loop. I'll check.

thierrygosselin commented 5 years ago

Good catch.

thierrygosselin commented 5 years ago
gds <- filter_hwe(data = gds, interactive.filter = interactive.filter, 
    filter.hwe = filter.hwe, strata=NULL, hw.pop.threshold = hw.pop.threshold, 
    midp.threshold = midp.threshold, parallel.core = parallel.core, 
    parameters = filters.parameters, path.folder = wf, verbose = verbose, 
    internal = FALSE)

That one is actually necessary with other type of datasets. I don't think it's the reason you experience problem with this function.

kevinmneal commented 5 years ago

1- Arial is probably the safest bet, a near-copy of Helvetica 3- I was curious because, in ipyrad I set minimum coverage to 6, but I seem to lose a lot of loci if I set the min here to 6. I've just been setting it to 1 for now. 6- I think it's going through the HWE calculations before asking if you want to filter 7- Yes, with that VCF. The issue might be that I'm using filter.common.markers=FALSE (at least for some analyses, I'd like to keep as many SNPs per population as possible, even if not common, mostly for doing effective population size calculations). I think when I did have it =TRUE that it was filtering properly. 10- Yeah, it became a never-ending loop regardless of what I entered. Sorry I don't have the parameters that triggered it, but if it comes up again I'll report back 13 (below)- Maybe that was being caused by my removing strata=NULL from filter_hwe, hmm. EDIT: Although it got an error after the detect_mixed_genomes step too. EDIT2: Yup, tried after putting strata=NULL back, still get the error.

Thanks!

thierrygosselin commented 5 years ago

Bug. I don't see this with my test dataset. Will check with your dataset

Awesome

kevinmneal commented 5 years ago

Also 7- is there a way to have radiator utilize the reference genome fasta for this? Does it require it/is it better with it?

thierrygosselin commented 5 years ago

Will have a new commit with fixes by the end of the day When you run this

test1 <- radiator::read_vcf(
     data = "OCall208_75pctmsng.recode.vcf",
     strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
     verbose = TRUE)

On my Mac I'm getting this at the end:

Computation time, overall: 11 sec
########################### completed read_vcf ###########################
Warning message:
In mclapply(seq_len(njobs), mc.preschedule = FALSE, mc.cores = njobs,  :
  1 function calls resulted in an error

It's likely related to some SeqArray function I'm using and some don't like when the number of CPU is lower than the number of pop, so I have a fix for it, but are you getting a similar message (it won't be mclapply, but something else)

thierrygosselin commented 5 years ago

Also 7- is there a way to have radiator utilize the reference genome fasta for this? Does it require it/is it better with it?

genome fasta was used for alignment right ? I'm not sure I understand how this is related to your point 7 for the common markers ?

kevinmneal commented 5 years ago

test1 <- radiator::read_vcf( data = "OCall208_75pctmsng.recode.vcf", strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt", verbose = TRUE)

For me on Windows, the same input returns the error

Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx,  : 
  One of the nodes produced an error: Can not open file 'C:\Users\kevin\Documents\Rtest\read_vcf_20190320@1141\01_import_gds\radiator_20190320@1141.gds'. The process cannot access the file because it is being used by another process.

I did use a genome fasta for alignment. I was wondering if incorporating that fasta into radiator operations added more information than is found in the VCF, but probably not, so ignore that.

Also, I did run more tests on long LD filtering with filter.common.markers either TRUE or FALSE, using long.ld.missing TRUE or FALSE. No combination seems to be filtering, so I'm really not sure what to do now. I think I might have gotten it to successfully filter once, but I don't remember the parameters there again (I really should be better at logging, but I've been running this dozens of times.)

kevinmneal commented 5 years ago

More info, if it helps with the long LD filtering:

When I run using long.ld.missing=TRUE, no SNPs are filtered, and this plot is the output: image

When I run using long.ld.missing=FALSE, no SNPs are filtered (regardless of threshold), and this plot is the output: image

kevinmneal commented 5 years ago

One more thing, an error I get after filtering when filter_rad passes results to genomic_converter; looks like it's an issue with the pcadapt output. Outputs up to that point are still being written, but none of the ones listed after in my filter_rad input.

Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv

Generating statistics after filtering
calculating individual stats...
[==================================================] 100%, completed in 0s
Extracting DP information...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...
Extracting DP information...
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s

Transferring data to genomic converter...
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 15
Synchronizing data and strata...
    Number of strata: 4
    Number of individuals: 208

Writing tidy data set:
OCall208.radiator.75pctmsng.oneSNPmac.rad
    * Number of sample pop, np = 4
    * Number of markers, nl = 6155
    * The highest number used to label an allele, nu = 4
    * The alleles are encoded with one digit number
Generating BayeScan file...
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20190320@1557

::radiatorfilter_common_markers function call arguments:
    data = tbl_df
    filter.common.markers = TRUE
    fig = FALSE
    parallel.core = 3
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in ::radiatorfilter_common_markers:
    internal = TRUE

Default "..." arguments assigned in ::radiatorfilter_common_markers:
    parameters = NULL
    path.folder = NULL

Scanning for common markers...

Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20190320@1557

::radiatorfilter_monomorphic function call arguments:
    data = tbl_df
    filter.monomorphic = TRUE
    parallel.core = 3
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in ::radiatorfilter_monomorphic:
    internal = TRUE

Default "..." arguments assigned in ::radiatorfilter_monomorphic:
    parameters = NULL
    path.folder = NULL

Scanning for monomorphic markers...

Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
Calibrating REF/ALT alleles...
    generating REF/ALT dictionary
    integrating genotypes codings...
writing BayeScan file with:
          Number of populations: 4
    Number of individuals: 208
    Number of biallelic markers: 6012
Writting populations dictionary
Writting markers dictionary
Generating pcadapt file...
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20190320@1604

::radiatorfilter_common_markers function call arguments:
    data = tbl_df
    filter.common.markers = TRUE
    fig = FALSE
    parallel.core = 3
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in ::radiatorfilter_common_markers:
    internal = TRUE

Default "..." arguments assigned in ::radiatorfilter_common_markers:
    parameters = NULL
    path.folder = NULL

Scanning for common markers...

Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20190320@1604

::radiatorfilter_monomorphic function call arguments:
    data = tbl_df
    filter.monomorphic = TRUE
    parallel.core = 3
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in ::radiatorfilter_monomorphic:
    internal = TRUE

Default "..." arguments assigned in ::radiatorfilter_monomorphic:
    parameters = NULL
    path.folder = NULL

Scanning for monomorphic markers...

Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
################################################################################
############################## radiator::filter_ld #############################
################################################################################
Execution date@time: 20190320@1604

filter_ld function call arguments:
    interactive.filter = TRUE
    data = tbl_df
    filter.short.ld = NULL
    filter.long.ld = NULL
    parallel.core = 3
    filename = NULL
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in filter_ld:
    ld.method = r2
    long.ld.missing = FALSE

Default "..." arguments assigned in filter_ld:
    internal = FALSE
    ld.figures = TRUE
    parameters = NULL
    path.folder = NULL
    subsample.markers.stats = NULL

File written: radiator_filter_ld_args_20190320@1604.tsv

Interactive mode: on

Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD

Filters parameters file generated: filters_parameters_20190320@1604.tsv
Filters parameters file: initiated
Minimizing short distance LD...

There is no variation in the number of SNP/locus across the data

Error: All columns in a tibble must be 1d or 2d objects:
* Column `VALUES` is NULL
Call `rlang::last_error()` to see a backtrace
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Computation time, overall: 1 sec
############################# completed filter_ld ##############################

Computation time, overall: 1519 sec
############################# completed filter_rad #############################
> rlang::last_error()
<error>
message: All columns in a tibble must be 1d or 2d objects:
* Column `VALUES` is NULL
class:   `rlang_error`
backtrace:
 1. radiator::filter_rad(...)
 2. radiator::genomic_converter(...) C:\Users\kevin\AppData\Local\Temp\Rtmpo9w9tT\file1e54a562a37:236:2
 3. radiator::write_pcadapt(data = input, filename = filename, parallel.core = parallel.core)
 4. radiator::filter_ld(...)
 5. radiator::radiator_parameters(...)
 6. tibble::tibble(...)
 7. tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
 8. tibble:::check_valid_cols(x)
Call `rlang::last_trace()` to see the full backtrace
> rlang::last_trace()
    x
 1. \-radiator::filter_rad(...)
 2.   \-radiator::genomic_converter(...) C:\Users\kevin\AppData\Local\Temp\Rtmpo9w9tT\file1e54a562a37:236:2
 3.     \-radiator::write_pcadapt(data = input, filename = filename, parallel.core = parallel.core)
 4.       \-radiator::filter_ld(...)
 5.         \-radiator::radiator_parameters(...)
 6.           \-tibble::tibble(...)
 7.             \-tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
 8.               \-tibble:::check_valid_cols(x)

Input was

OCall208.filtered <- radiator::filter_rad(data = OCall208.tidy, 
                                          strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata_K3_ARTIF.txt",
                                          verbose = TRUE,
                                          filename="OCall208.radiator.75pctmsng.oneSNPmac",
                                          parallel.core = 1,
                                          fig.upsetr=TRUE,
                                          filter.common.markers=FALSE,
                                          output=c("vcf", "genepop", "tidy", "plink", "genind", "genlight", "structure", "hierfstat", "bayescan", "betadiv", "pcadapt", "hzar", "related", "maverick"),
                                          path.folder="G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphaOCclust85/OCall208.radiator.filtered")
thierrygosselin commented 5 years ago

pcadapt : the function will only keep common markers and discard monomorphic markers automatically. Will check that one first. maverick: Will no longer work. I have to check what changed with the rmaverick.

thierrygosselin commented 5 years ago

fig.upsetr: forget about it. I now detect automatically if the number of samples will break UpSetR

thierrygosselin commented 5 years ago

Will have a solution strategy soon for your uneven sample size dataset... But you're not really going to do Ne on the populations with less than 15 samples ?

What's the specie? Also, just for my brain and correlations: Sequencing technology used and type of RAD wet lab technique ?

thierrygosselin commented 5 years ago
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 208 / 26 / 6660 / 6660 / 8719
    Blacklisted: 0 / 0 / 4675 / 4675 / 6327
    After: 208 / 26 / 1985 / 1985 / 2392

You really don't want that one...

kevinmneal commented 5 years ago

I've been getting Ne values out of NeEstimator with pretty narrow parametric confidence intervals on ~10 samples (jackknife CI is bigger though...), but I have been worried about bias which was a big reason for why I wanted to use radiator :) I have a few sites where I have 20-30 individuals and will compare results between using all individuals in a pond vs. using 10, but since they're tadpoles usually sampled in one trip, I worry about very high relatedness (which is why most ponds have ~10 now, after filtering for relatedness). I should try getting Ne using Colony but that program is not the most user-friendly... I'm also calculating Ne for genetic clusters in addition to pond-level.

Species is the western spadefoot toad, Spea hammondii. Sequenced with Illumina HiSeq 4000, wet lab technique was 3RAD (ddRAD but with a third enzyme to cut primer dimers) https://www.biorxiv.org/content/10.1101/205799v3

I've been trying out using common markers on a different strata, on K=3 clusters + a category for artificial ponds (so 4 strata) that I got from FastStructure. When I do that, the filtering is more reasonable:

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 208 / 4 / 6660 / 6660 / 8719
    Blacklisted: 0 / 0 / 148 / 148 / 193
    After: 208 / 4 / 6512 / 6512 / 8526
thierrygosselin commented 5 years ago

Have you tried ldne ? Eric Archer package strataG as a nice ldNe interface... ??strataG::ldNe my conversion function for gtypes won't work, he updated the internal structure and I'm currently updating it. But you could easily do it from a genind object with his conversion function.

thierrygosselin commented 5 years ago

Colony: I understand, and it's not the fastest. My output used to work because I ran a lot of dataset on colony to get closekin

However, you can get a pretty good estimate of how close your samples are with the detect_duplicate_genomes (it's running automatically in filter_rad). So far, clouds of closekin are almost always between 0.25 and 0.75

thierrygosselin commented 5 years ago

filter for common markers I wouldn't use it for now with low numbers like that, you'll likely throw out good markers. Skip that one.

Toad: very cool

kevinmneal commented 5 years ago

Do and Waples developed NeEstimator (https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12157), which includes LDNe, molecular coancestry, and heterozygosity excess methods. Will give ldne in strataG a try - I'd been looking for an Ne implementation in R, thanks for pointing it out, and it looks like they implement some bias corrections for using thousands of SNPs that I don't think NeEstimator has even implemented yet: https://www.nature.com/articles/hdy201660

So far, clouds of closekin are almost always between 0.25 and 0.75

image

😅 I've since removed that top outlier, at least... my original relatedness filtering was based off of KING (implemented as --relatedness2 in vcftools) (https://academic.oup.com/bioinformatics/article/26/22/2867/228512) I think most of that cloud from 0.25-0.5 is from some artificial pools that are derived from a very small sample, though, so that's probably demonstrative of a founder effect.

Of course, whether it's appropriate to purge close relatives is still up for debate: https://onlinelibrary.wiley.com/doi/full/10.1111/mec.14022

thierrygosselin commented 5 years ago

I love that function... it's basically an individual tree, but Manhattan style, which is way easier to highlight outlier... and so far you would believe how much are out there in published dataset...

With some of my own tuna datasets, removing duplicates or very close kin revealed another K (group) in DAPC and stockR... it was the filter with the biggest impact

duplicates

thierrygosselin commented 5 years ago

FYI Eric Archer and Robin have work closely on some projects, so I would trust the implementation

KING: be careful with that one, I compared results with colony and KING and some of KING results made no sense.

It's also implemented in SNPRelate (that uses GDS that radiator supports).

thierrygosselin commented 5 years ago

As for purging those close relatives, I think it really depends on the number of close kin, the dataset, the number of markers and differentiation... it's been a while I read that paper and remember having talked about it to both authors...

You probably read this one: Sampling related individuals within ponds biases estimates of population structure in a pond‐breeding amphibian by O'Connell et al.

thierrygosselin commented 5 years ago

lots of close relative in there... the relationship will be interesting to highlight...With colony you could also have a good idea on mating success and polygynandry closekin toad .

thierrygosselin commented 5 years ago

About minor allele count/frequency... I don't know what you used in ipyrad or in radiator but the very minimum...is probably 3...

Filter mac threshold: 3
Number of individuals / strata / chrom / locus / SNP:
    Before: 208 / 26 / 6660 / 6660 / 8719
    Blacklisted: 0 / 0 / 1060 / 1060 / 1754
    After: 208 / 26 / 5600 / 5600 / 6965

And if everything before that was unfiltered, like in this example above, you really drop a lot of markers...

A MAC = 3 is like saying: Overall samples, you need: 3 heterozygote individuals or 1 heterozygote individual + 1 homozygote for the ALT allele to keep a marker The minimum number of samples is 2

A MAC = 2 can potentially mean just 1 sample ... so you need good coverage and confidence. If you use 2, check the coverage for those alternate alleles

kevinmneal commented 5 years ago

Yeah, the manhattan cloud is awesome. That stockR figure is really interesting. In my case, removing close relatives collapsed clusters--it was forcing a bunch of artificial pools derived from a single site (geographically and phylogenetically within another cluster) into their own cluster at K=2. Removing close relatives brought the artificial pools back in with the expected cluster. My guess is due to founder effect making the pools look far more differentiated than reality since they aren't very distant phylogenetically; could have been an issue of uneven sampling though. That cloud from 0.25-0.5 is almost all from these same artificial pools.

Re: MAC: I've been using MAC=2 (the vcf I sent you used mac=2 in vcftools), though for the same reason you explain, I have been considering MAC=3. My clustering results don't really change whether using 2 or 3, though. Vcftools also has an option of filtering both singletons and private doubletons, which I also toyed with, but decided a single homozygous individual with the ALT allele still represents real genetic diversity in the population (assuming good coverage, yes) - I'll check coverage on those, but might just do mac=3 like you suggest for simplicity

thierrygosselin commented 5 years ago

I have a function that highlight coverage for lower MAC... will find it. It was quick. If there's good coverage keep them. But testing with or without and looking at downstream impact is the best and fastest options.

detect_mixed_genomes is probably my favorite. I'm working with Eric Anderson on heterozygote miscall rate... and this one is good to highlight wet-lab problems. But also find cryptic species and all sorts of coverage related problems

thierrygosselin commented 5 years ago

Having lots of homozygotes generates something similar to closekin... it generates false structure. Having lots of heterozygotes is the opposite... because having 2 copies of each alleles makes you close to everybody.. you can see it easily in output of the function mentioned above

kevinmneal commented 5 years ago

There's a recent paper that tests maf/mac thresholds and their effect on structure and comes to the same conclusion, that rare alleles in general mess up structure: https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12995 so I'll do mac=3...

That function would be great to use if you find it, thanks!

thierrygosselin commented 5 years ago

I did the move of switching my MAF filter to MAC filter just a bit before reading that paper when it was underpress or biorxiv, don't remember, but I'm glad I did. It's really a nice read.

thierrygosselin commented 5 years ago

Question for coverage for low Mac... with ipyrad: do you have coverage for REF/ALT allele or just the read depth of the genotypes ?

thierrygosselin commented 5 years ago

Because I have a function, not tested with your dataset, and it's been a while, work with tidy data, will update to work with GDS as well, but I think it requires there coverage for both alleles (which makes sense)...

??radiator::detect_allele_problems

thierrygosselin commented 5 years ago

Individual's heterozygosity

individual.heterozygosity.manhattan.plot.pdf

kevinmneal commented 5 years ago

The ipyrad vcf contains read depth for each base pair in the CATG format (this is custom to ipyrad, I believe). I wanted to code up a way to calculate REF/ALT depths from those based on the REF/ALT base pair columns, but figured it would take me too long/got distracted/not sure I'd actually know how to. A quick perusal of the vcf does show some genotypes where there's e.g. 50 reads for A and only 1 for C; not exactly sure how ipyrad/downstream software are calling that as a homozygote vs heterozygote.

I'll try out that function, thanks Thierry.

That het manhattan plot is a beauty and really does summarize a lot. All those IMXX sites are the artificial ones, very low heterozygosity. SHOE is artificial too, but very close to some natural ponds and likely some gene flow with those. Most of the other sites with low heterozygosity (CCSP, LAGUN, MORO, BBEND) are themselves likely a result of a relatively recent natural founder event, geographically isolated from the others. The sites with high heterozygosity are in more suitable, connected habitat.

thierrygosselin commented 5 years ago

What you don't want to see is bubble size correlated with the patterns of heterozygosity Here it's all good, I'll send by email a couple of bad ones, you'll understand

thierrygosselin commented 5 years ago

CATG oh ok that's easy then... give me 30 min I'll draft something

Ok so I'll now parse the CATG column into 4. What's nice is that you can re-calibrate the alleles REF/ALT based on the actual count of alleles, not just individuals. Which is way better and less biased for low coverage data.

kevinmneal commented 5 years ago

Thanks 👍

I feel bad making all these requests--I hope they've been useful and not tedious--but one other request if it's easy to implement: faststructure format output. Not sure why it's necessary, but for faststructure they ignore the first 6 columns, so would need to add 5 blank columns before the first marker column. Otherwise it's just a standard structure file with two rows per individual. (Alternatively, faststructure accepts a .bed file)

from https://rajanil.github.io/fastStructure/

While the original Structure program allowed for a more flexible input format, fastStructure expects a more specific Structure-like input format. Specifically, rows in the data file correspond to samples, with two rows per sample (note that only diploids are handled by this software), and columns correspond to SNPs. The first 6 columns of the file will be ignored; these typically would include IDs, metadata, etc. This software only handles bi-allelic loci. The two alleles at each locus can be encoded as desired; however, missing data should be encoded as `-9'.

An example of a faststructure file I made by converting a vcf to faststructure in PGDSpider (faststructure just ignores the numbers there in the extra columns, I believe):

OCcoastsub48_85clust_75pctmsng_oneSNPmac2_20190205.recode.str.txt

Immensely grateful for all the fixes and help you've given me so far, thanks Thierry

thierrygosselin commented 5 years ago

Look at this link for faststructure ;) Search my name on the page... was using it a bit a while ago... I should have an output for it somewhere or it was importing one already outputted by radiator. Will check

thierrygosselin commented 5 years ago

Screenshot of the doc for the metadata Capture d’écran, le 2019-03-21 Ă  17 29 57

Screenshot during function execution to know what to expect

Capture d’écran, le 2019-03-21 Ă  17 40 44

Screenshot of the doc for VCF import and parsing of markers metadata Capture d’écran, le 2019-03-21 Ă  17 29 23

kevinmneal commented 5 years ago

Accidentally clicked close, whoops. Reopened.

Fantastic, looking forward to the update!

thierrygosselin commented 5 years ago

write_faststructure: available as standalone. Let me know if it works as expected, will include it as an option in genomic_converter after