Closed kevinmneal closed 5 years ago
Hi Kevin, thanks a lot for the bug report. I don't have a PC nearby so it's very useful, I'm giving a workshop in 2 weeks and 19 out of the 25 participants have PCs...
I'll break the reply on topics...
1. Font Helvetica: just realized that my favorite font is not pre-installed on PCs. Snif. Will find a replacement found in all 3 OS...
a plot for min mean would be useful
It's on my todo list to have the helper plot, it's there, just not outputting it.
But it's less important or rational, because it all falls down to your comfort with low coverage...
Personally I don't go below 7, because of null allele, heterozygosity, etc.
3- I was curious because, in ipyrad I set minimum coverage to 6, but I seem to lose a lot of loci if I set the min here to 6. I've just been setting it to 1 for now.
I'll check that at the same time. But it wouldn't be the first time a pipeline says it's filtering x and it doesn't... do you know if it's <=
or <
, because for your dataset that would be a big deal (see also my observation/comment of Minot Allele Count below...
You could try this with your dataset and the latest release:
data <- radiator::tidy_vcf(data = "OCall208_75pctmsng.recode.vcf",
strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
filter.common.markers = FALSE,
verbose = TRUE)
cov.markers <- data %>%
dplyr::group_by(MARKERS) %>%
dplyr::summarise(
MEAN = mean(READ_DEPTH, na.rm = TRUE),
MEDIAN = median(READ_DEPTH, na.rm = TRUE)
)
dplyr::filter(cov.markers, MEAN <= 6) %>% nrow(.)
dplyr::filter(cov.markers, MEAN < 6) %>% nrow(.)
dplyr::filter(cov.markers, MEDIAN <= 6) %>% nrow(.)
dplyr::filter(cov.markers, MEDIAN < 6) %>% nrow(.)
You should get these numbers for the unfiltered dataset:
2829
2826
3990
3680
Now both helper plot (low and high are printed) Using these thresholds:
Filter mean coverage thresholds: 7 / 100
Number of individuals / strata / chrom / locus / SNP:
Before: 208 / 26 / 5600 / 5600 / 6965
Blacklisted: 0 / 0 / 1773 / 1773 / 2248
After: 208 / 26 / 3827 / 3827 / 4717
There's definitely things to test with that filter and downstream analysis. I know some people who wouldn't do closekin with markers below 20 of coverage ... but that's up for debate...
The reason I show the missing per strata, the mean value and overall is that it helps to see if the populations are "equal". You can already predict if a pattern of missingness is present from this filter.
Remember, their is nothing missing unless you compare it with something...
The course of action when some populations drags down the polymorphism discovery is to remove them, do the downstream analysis. Re-introduce the population and see if it changes something.
The impact depends on the analysis:
Update: so that filter no longer crash RStudio on my end when the number of samples for a pop is below the number of CPU used...
You should see it if verbose = TRUE
for main function. After, for the modules inside it's printed in the file with _args_
in it.
Additionally, the values for threshold, along CHROM/LOCUS/SNP kept and blacklisted for each filtering steps are written in the filters_parameters
file in the 01_radiator
folder
6. asking if you want to run a particular filtering step interactively, e.g. asking if you want to skip calculating HWE since it takes a long time
Most are fast enough: I usually test with 2M markers
As for HWE
it's the only filter I'm not confortable 100% in removing markers. You really have to know what you're doing. Consequently, you should get the message Do you want to continue with the filtering ?
after the plots are produced...
More on that filter below...
The only other now I saw useful to do it is the pruning of markers based on long ld... it also ask if you want to proceed or not. Because it's usually fast with a reference genome but can be long for de novo data.
It should work. No need to collapse CHROM. Likely a bug. with your VCF. I'm testing it with and without a genome, so I'll run it again to see if I broke something. WITH MISSING is currently only available WITH a reference genome. It's a computational issue for now.
The problem you're seeing is it with the VCF you sent me yesterday ? If so I'll do the test with it.
update so do you really have a reference genome or an approach using a catalog of individuals as pseudo-reference ? It looks like the info in CHROM
are more like locus than scaffolds. You have 6000 of them almost as much as the markers.
Ok so I checked ipyrad doc and the options are described here
options:
filter_ld
examples that will show how to easily change the CHROM
column in your GDS object and/or tidy data. So you could skip the long LD step in filter_rad
... and do it separately in filter_ld
.CHROM
right from the start...Code to change the chromosome information:
#Change the CHROM column-------------------------------------------------------
gds <- read_rad("full path to gds or .rad.gds file")
radiator::summary_gds(gds)
markers.meta <- radiator::extract_markers_metadata(gds)
# Keep a back up
markers.meta.bk <- markers.meta
# change the CHROM column
markers.meta %<>%
dplyr::mutate(CHROM = "chrom_1")
# Update the GDS markers metadata (this is the radiator part)
update_radiator_gds(
gds = gds,
node.name = "markers.meta",
value = markers.meta,
replace = TRUE
)
# Update the chromosome info in GDS (GDS, SeqArray part)
update_radiator_gds(
gds = gds,
radiator.gds = FALSE,
node.name = "chromosome",
value = markers.meta$CHROM,
replace = TRUE
)
#Ready for LD without the code running by scaffolds...
8. Transferring to genomic_converter requires doing the REF/ALT calibration again. Not a major issue but adds some time.
Yes, a pain. I have a fix for this that tracks if it's been done. It's always a problem with low coverage data + the removal of samples. It generates more monomorphic markers and requires recalibrating REF/ALT...
my plan in the next couple of weeks is to have a C++ code do it faster.
9.in general, is there a way to exit the interactive mode? When the HWE filter couldn't detect my inputs, I had to restart the R session to get out.
Sadly, no easy way. I have a solution in the vignette that shows how to start with the GDS file, removes some filters, etc.
The other solution might be for you to just use the filter modules you want and use magrittr %>%
in-between.
Todo list: use shiny and let use pick filters on the left side and click and enter thresholds as the modular pipeline goes...
10. when filtering by HWE, interactive mode doesn't always detect the asterisk inputs; I think this happened when I tried setting hw.pop.threshold equal to the number of pops, or it may happen when some strata are removed for having n < 10, but I don't remember exactly. I just tried to re-run on strata where none were removed and didn't get the error.
Bug I'll check the combination mentioned above. It's not for n < 10 (that's tested) not tested is when you enter to the same number of pops.
Usually with asterisk, if entering e.g. *****
returns no filtering, the interactive version will prompt to enter a new value... that might trigger an ever ending loop. I'll check.
Good catch.
gds <- filter_hwe(data = gds, interactive.filter = interactive.filter,
filter.hwe = filter.hwe, strata=NULL, hw.pop.threshold = hw.pop.threshold,
midp.threshold = midp.threshold, parallel.core = parallel.core,
parameters = filters.parameters, path.folder = wf, verbose = verbose,
internal = FALSE)
That one is actually necessary with other type of datasets. I don't think it's the reason you experience problem with this function.
1- Arial is probably the safest bet, a near-copy of Helvetica
3- I was curious because, in ipyrad I set minimum coverage to 6, but I seem to lose a lot of loci if I set the min here to 6. I've just been setting it to 1 for now.
6- I think it's going through the HWE calculations before asking if you want to filter
7- Yes, with that VCF. The issue might be that I'm using filter.common.markers=FALSE
(at least for some analyses, I'd like to keep as many SNPs per population as possible, even if not common, mostly for doing effective population size calculations). I think when I did have it =TRUE that it was filtering properly.
10- Yeah, it became a never-ending loop regardless of what I entered. Sorry I don't have the parameters that triggered it, but if it comes up again I'll report back
13 (below)- Maybe that was being caused by my removing strata=NULL
from filter_hwe
, hmm. EDIT: Although it got an error after the detect_mixed_genomes
step too. EDIT2: Yup, tried after putting strata=NULL back, still get the error.
Thanks!
Bug. I don't see this with my test dataset. Will check with your dataset
Awesome
Also 7- is there a way to have radiator utilize the reference genome fasta for this? Does it require it/is it better with it?
Will have a new commit with fixes by the end of the day When you run this
test1 <- radiator::read_vcf(
data = "OCall208_75pctmsng.recode.vcf",
strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
verbose = TRUE)
On my Mac I'm getting this at the end:
Computation time, overall: 11 sec
########################### completed read_vcf ###########################
Warning message:
In mclapply(seq_len(njobs), mc.preschedule = FALSE, mc.cores = njobs, :
1 function calls resulted in an error
It's likely related to some SeqArray function I'm using and some don't like when the number of CPU is lower than the number of pop, so I have a fix for it, but are you getting a similar message (it won't be
mclapply
, but something else)
Also 7- is there a way to have radiator utilize the reference genome fasta for this? Does it require it/is it better with it?
genome fasta was used for alignment right ? I'm not sure I understand how this is related to your point 7 for the common markers ?
test1 <- radiator::read_vcf( data = "OCall208_75pctmsng.recode.vcf", strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt", verbose = TRUE)
For me on Windows, the same input returns the error
Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, :
One of the nodes produced an error: Can not open file 'C:\Users\kevin\Documents\Rtest\read_vcf_20190320@1141\01_import_gds\radiator_20190320@1141.gds'. The process cannot access the file because it is being used by another process.
I did use a genome fasta for alignment. I was wondering if incorporating that fasta into radiator operations added more information than is found in the VCF, but probably not, so ignore that.
Also, I did run more tests on long LD filtering with filter.common.markers either TRUE or FALSE, using long.ld.missing TRUE or FALSE. No combination seems to be filtering, so I'm really not sure what to do now. I think I might have gotten it to successfully filter once, but I don't remember the parameters there again (I really should be better at logging, but I've been running this dozens of times.)
More info, if it helps with the long LD filtering:
When I run using long.ld.missing=TRUE, no SNPs are filtered, and this plot is the output:
When I run using long.ld.missing=FALSE, no SNPs are filtered (regardless of threshold), and this plot is the output:
One more thing, an error I get after filtering when filter_rad passes results to genomic_converter; looks like it's an issue with the pcadapt output. Outputs up to that point are still being written, but none of the ones listed after in my filter_rad input.
Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv
Generating statistics after filtering
calculating individual stats...
[==================================================] 100%, completed in 0s
Extracting DP information...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...
Extracting DP information...
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
Transferring data to genomic converter...
Calibrating REF/ALT alleles...
number of REF/ALT switch = 15
Synchronizing data and strata...
Number of strata: 4
Number of individuals: 208
Writing tidy data set:
OCall208.radiator.75pctmsng.oneSNPmac.rad
* Number of sample pop, np = 4
* Number of markers, nl = 6155
* The highest number used to label an allele, nu = 4
* The alleles are encoded with one digit number
Generating BayeScan file...
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20190320@1557
::radiatorfilter_common_markers function call arguments:
data = tbl_df
filter.common.markers = TRUE
fig = FALSE
parallel.core = 3
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in ::radiatorfilter_common_markers:
internal = TRUE
Default "..." arguments assigned in ::radiatorfilter_common_markers:
parameters = NULL
path.folder = NULL
Scanning for common markers...
Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20190320@1557
::radiatorfilter_monomorphic function call arguments:
data = tbl_df
filter.monomorphic = TRUE
parallel.core = 3
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in ::radiatorfilter_monomorphic:
internal = TRUE
Default "..." arguments assigned in ::radiatorfilter_monomorphic:
parameters = NULL
path.folder = NULL
Scanning for monomorphic markers...
Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
Calibrating REF/ALT alleles...
generating REF/ALT dictionary
integrating genotypes codings...
writing BayeScan file with:
Number of populations: 4
Number of individuals: 208
Number of biallelic markers: 6012
Writting populations dictionary
Writting markers dictionary
Generating pcadapt file...
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20190320@1604
::radiatorfilter_common_markers function call arguments:
data = tbl_df
filter.common.markers = TRUE
fig = FALSE
parallel.core = 3
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in ::radiatorfilter_common_markers:
internal = TRUE
Default "..." arguments assigned in ::radiatorfilter_common_markers:
parameters = NULL
path.folder = NULL
Scanning for common markers...
Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20190320@1604
::radiatorfilter_monomorphic function call arguments:
data = tbl_df
filter.monomorphic = TRUE
parallel.core = 3
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in ::radiatorfilter_monomorphic:
internal = TRUE
Default "..." arguments assigned in ::radiatorfilter_monomorphic:
parameters = NULL
path.folder = NULL
Scanning for monomorphic markers...
Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
################################################################################
############################## radiator::filter_ld #############################
################################################################################
Execution date@time: 20190320@1604
filter_ld function call arguments:
interactive.filter = TRUE
data = tbl_df
filter.short.ld = NULL
filter.long.ld = NULL
parallel.core = 3
filename = NULL
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in filter_ld:
ld.method = r2
long.ld.missing = FALSE
Default "..." arguments assigned in filter_ld:
internal = FALSE
ld.figures = TRUE
parameters = NULL
path.folder = NULL
subsample.markers.stats = NULL
File written: radiator_filter_ld_args_20190320@1604.tsv
Interactive mode: on
Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD
Filters parameters file generated: filters_parameters_20190320@1604.tsv
Filters parameters file: initiated
Minimizing short distance LD...
There is no variation in the number of SNP/locus across the data
Error: All columns in a tibble must be 1d or 2d objects:
* Column `VALUES` is NULL
Call `rlang::last_error()` to see a backtrace
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Computation time, overall: 1 sec
############################# completed filter_ld ##############################
Computation time, overall: 1519 sec
############################# completed filter_rad #############################
> rlang::last_error()
<error>
message: All columns in a tibble must be 1d or 2d objects:
* Column `VALUES` is NULL
class: `rlang_error`
backtrace:
1. radiator::filter_rad(...)
2. radiator::genomic_converter(...) C:\Users\kevin\AppData\Local\Temp\Rtmpo9w9tT\file1e54a562a37:236:2
3. radiator::write_pcadapt(data = input, filename = filename, parallel.core = parallel.core)
4. radiator::filter_ld(...)
5. radiator::radiator_parameters(...)
6. tibble::tibble(...)
7. tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
8. tibble:::check_valid_cols(x)
Call `rlang::last_trace()` to see the full backtrace
> rlang::last_trace()
x
1. \-radiator::filter_rad(...)
2. \-radiator::genomic_converter(...) C:\Users\kevin\AppData\Local\Temp\Rtmpo9w9tT\file1e54a562a37:236:2
3. \-radiator::write_pcadapt(data = input, filename = filename, parallel.core = parallel.core)
4. \-radiator::filter_ld(...)
5. \-radiator::radiator_parameters(...)
6. \-tibble::tibble(...)
7. \-tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
8. \-tibble:::check_valid_cols(x)
Input was
OCall208.filtered <- radiator::filter_rad(data = OCall208.tidy,
strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata_K3_ARTIF.txt",
verbose = TRUE,
filename="OCall208.radiator.75pctmsng.oneSNPmac",
parallel.core = 1,
fig.upsetr=TRUE,
filter.common.markers=FALSE,
output=c("vcf", "genepop", "tidy", "plink", "genind", "genlight", "structure", "hierfstat", "bayescan", "betadiv", "pcadapt", "hzar", "related", "maverick"),
path.folder="G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphaOCclust85/OCall208.radiator.filtered")
pcadapt
: the function will only keep common markers and discard monomorphic markers automatically. Will check that one first.
maverick
: Will no longer work. I have to check what changed with the rmaverick
.
fig.upsetr
: forget about it. I now detect automatically if the number of samples will break UpSetR
Will have a solution strategy soon for your uneven sample size dataset... But you're not really going to do Ne on the populations with less than 15 samples ?
What's the specie? Also, just for my brain and correlations: Sequencing technology used and type of RAD wet lab technique ?
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
Before: 208 / 26 / 6660 / 6660 / 8719
Blacklisted: 0 / 0 / 4675 / 4675 / 6327
After: 208 / 26 / 1985 / 1985 / 2392
You really don't want that one...
I've been getting Ne values out of NeEstimator with pretty narrow parametric confidence intervals on ~10 samples (jackknife CI is bigger though...), but I have been worried about bias which was a big reason for why I wanted to use radiator :) I have a few sites where I have 20-30 individuals and will compare results between using all individuals in a pond vs. using 10, but since they're tadpoles usually sampled in one trip, I worry about very high relatedness (which is why most ponds have ~10 now, after filtering for relatedness). I should try getting Ne using Colony but that program is not the most user-friendly... I'm also calculating Ne for genetic clusters in addition to pond-level.
Species is the western spadefoot toad, Spea hammondii. Sequenced with Illumina HiSeq 4000, wet lab technique was 3RAD (ddRAD but with a third enzyme to cut primer dimers) https://www.biorxiv.org/content/10.1101/205799v3
I've been trying out using common markers on a different strata, on K=3 clusters + a category for artificial ponds (so 4 strata) that I got from FastStructure. When I do that, the filtering is more reasonable:
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
Before: 208 / 4 / 6660 / 6660 / 8719
Blacklisted: 0 / 0 / 148 / 148 / 193
After: 208 / 4 / 6512 / 6512 / 8526
Have you tried ldne ?
Eric Archer package strataG as a nice ldNe interface... ??strataG::ldNe
my conversion function for gtypes
won't work, he updated the internal structure and I'm currently updating it. But you could easily do it from a genind object with his conversion function.
Colony: I understand, and it's not the fastest. My output used to work because I ran a lot of dataset on colony to get closekin
However, you can get a pretty good estimate of how close your samples are with the detect_duplicate_genomes
(it's running automatically in filter_rad
). So far, clouds of closekin are almost always between 0.25 and 0.75
filter for common markers I wouldn't use it for now with low numbers like that, you'll likely throw out good markers. Skip that one.
Toad: very cool
Do and Waples developed NeEstimator (https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12157), which includes LDNe, molecular coancestry, and heterozygosity excess methods. Will give ldne in strataG a try - I'd been looking for an Ne implementation in R, thanks for pointing it out, and it looks like they implement some bias corrections for using thousands of SNPs that I don't think NeEstimator has even implemented yet: https://www.nature.com/articles/hdy201660
So far, clouds of closekin are almost always between 0.25 and 0.75
đ I've since removed that top outlier, at least... my original relatedness filtering was based off of KING (implemented as --relatedness2 in vcftools) (https://academic.oup.com/bioinformatics/article/26/22/2867/228512) I think most of that cloud from 0.25-0.5 is from some artificial pools that are derived from a very small sample, though, so that's probably demonstrative of a founder effect.
Of course, whether it's appropriate to purge close relatives is still up for debate: https://onlinelibrary.wiley.com/doi/full/10.1111/mec.14022
I love that function... it's basically an individual tree, but Manhattan style, which is way easier to highlight outlier... and so far you would believe how much are out there in published dataset...
With some of my own tuna datasets, removing duplicates or very close kin revealed another K (group) in DAPC and stockR... it was the filter with the biggest impact
FYI Eric Archer and Robin have work closely on some projects, so I would trust the implementation
KING: be careful with that one, I compared results with colony and KING and some of KING results made no sense.
It's also implemented in SNPRelate (that uses GDS that radiator supports).
As for purging those close relatives, I think it really depends on the number of close kin, the dataset, the number of markers and differentiation... it's been a while I read that paper and remember having talked about it to both authors...
You probably read this one: Sampling related individuals within ponds biases estimates of population structure in a pondâbreeding amphibian by O'Connell et al.
lots of close relative in there... the relationship will be interesting to highlight...With colony you could also have a good idea on mating success and polygynandry .
About minor allele count/frequency... I don't know what you used in ipyrad or in radiator but the very minimum...is probably 3...
Filter mac threshold: 3
Number of individuals / strata / chrom / locus / SNP:
Before: 208 / 26 / 6660 / 6660 / 8719
Blacklisted: 0 / 0 / 1060 / 1060 / 1754
After: 208 / 26 / 5600 / 5600 / 6965
And if everything before that was unfiltered, like in this example above, you really drop a lot of markers...
A MAC = 3
is like saying:
Overall samples, you need:
3 heterozygote individuals or
1 heterozygote individual + 1 homozygote for the ALT allele
to keep a marker
The minimum number of samples is 2
A MAC = 2
can potentially mean just 1 sample ... so you need good coverage and confidence.
If you use 2, check the coverage for those alternate alleles
Yeah, the manhattan cloud is awesome. That stockR figure is really interesting. In my case, removing close relatives collapsed clusters--it was forcing a bunch of artificial pools derived from a single site (geographically and phylogenetically within another cluster) into their own cluster at K=2. Removing close relatives brought the artificial pools back in with the expected cluster. My guess is due to founder effect making the pools look far more differentiated than reality since they aren't very distant phylogenetically; could have been an issue of uneven sampling though. That cloud from 0.25-0.5 is almost all from these same artificial pools.
Re: MAC: I've been using MAC=2 (the vcf I sent you used mac=2 in vcftools), though for the same reason you explain, I have been considering MAC=3. My clustering results don't really change whether using 2 or 3, though. Vcftools also has an option of filtering both singletons and private doubletons, which I also toyed with, but decided a single homozygous individual with the ALT allele still represents real genetic diversity in the population (assuming good coverage, yes) - I'll check coverage on those, but might just do mac=3 like you suggest for simplicity
I have a function that highlight coverage for lower MAC... will find it. It was quick. If there's good coverage keep them. But testing with or without and looking at downstream impact is the best and fastest options.
detect_mixed_genomes
is probably my favorite. I'm working with Eric Anderson on heterozygote miscall rate... and this one is good to highlight wet-lab problems. But also find cryptic species and all sorts of coverage related problems
Having lots of homozygotes generates something similar to closekin... it generates false structure. Having lots of heterozygotes is the opposite... because having 2 copies of each alleles makes you close to everybody.. you can see it easily in output of the function mentioned above
There's a recent paper that tests maf/mac thresholds and their effect on structure and comes to the same conclusion, that rare alleles in general mess up structure: https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12995 so I'll do mac=3...
That function would be great to use if you find it, thanks!
I did the move of switching my MAF filter to MAC filter just a bit before reading that paper when it was underpress or biorxiv, don't remember, but I'm glad I did. It's really a nice read.
Question for coverage for low Mac... with ipyrad: do you have coverage for REF/ALT allele or just the read depth of the genotypes ?
Because I have a function, not tested with your dataset, and it's been a while, work with tidy data, will update to work with GDS as well, but I think it requires there coverage for both alleles (which makes sense)...
??radiator::detect_allele_problems
Individual's heterozygosity
The ipyrad vcf contains read depth for each base pair in the CATG format (this is custom to ipyrad, I believe). I wanted to code up a way to calculate REF/ALT depths from those based on the REF/ALT base pair columns, but figured it would take me too long/got distracted/not sure I'd actually know how to. A quick perusal of the vcf does show some genotypes where there's e.g. 50 reads for A and only 1 for C; not exactly sure how ipyrad/downstream software are calling that as a homozygote vs heterozygote.
I'll try out that function, thanks Thierry.
That het manhattan plot is a beauty and really does summarize a lot. All those IMXX sites are the artificial ones, very low heterozygosity. SHOE is artificial too, but very close to some natural ponds and likely some gene flow with those. Most of the other sites with low heterozygosity (CCSP, LAGUN, MORO, BBEND) are themselves likely a result of a relatively recent natural founder event, geographically isolated from the others. The sites with high heterozygosity are in more suitable, connected habitat.
What you don't want to see is bubble size correlated with the patterns of heterozygosity Here it's all good, I'll send by email a couple of bad ones, you'll understand
CATG oh ok that's easy then... give me 30 min I'll draft something
Ok so I'll now parse the CATG column into 4. What's nice is that you can re-calibrate the alleles REF/ALT based on the actual count of alleles, not just individuals. Which is way better and less biased for low coverage data.
Thanks đ
I feel bad making all these requests--I hope they've been useful and not tedious--but one other request if it's easy to implement: faststructure format output. Not sure why it's necessary, but for faststructure they ignore the first 6 columns, so would need to add 5 blank columns before the first marker column. Otherwise it's just a standard structure file with two rows per individual. (Alternatively, faststructure accepts a .bed file)
from https://rajanil.github.io/fastStructure/
While the original Structure program allowed for a more flexible input format, fastStructure expects a more specific Structure-like input format. Specifically, rows in the data file correspond to samples, with two rows per sample (note that only diploids are handled by this software), and columns correspond to SNPs. The first 6 columns of the file will be ignored; these typically would include IDs, metadata, etc. This software only handles bi-allelic loci. The two alleles at each locus can be encoded as desired; however, missing data should be encoded as `-9'.
An example of a faststructure file I made by converting a vcf to faststructure in PGDSpider (faststructure just ignores the numbers there in the extra columns, I believe):
OCcoastsub48_85clust_75pctmsng_oneSNPmac2_20190205.recode.str.txt
Immensely grateful for all the fixes and help you've given me so far, thanks Thierry
Look at this link for faststructure ;) Search my name on the page... was using it a bit a while ago... I should have an output for it somewhere or it was importing one already outputted by radiator. Will check
[x] Will update radiator very soon with these changes:
Will now import and parse the VCF metadata CATG
Screenshot of the doc for the metadata
Screenshot during function execution to know what to expect
The CATG columns are automatically matched with REF
and ALT
columns to generate ALLELE_REF_DEPTH
and ALLELE_ALT_DEPTH
columns... this is what you need for radiator::detect_allele_problems
Documentation for the markers metadata parsing is also updated...
Screenshot of the doc for VCF import and parsing of markers metadata
Accidentally clicked close, whoops. Reopened.
Fantastic, looking forward to the update!
write_faststructure
: available as standalone. Let me know if it works as expected, will include it as an option in genomic_converter
after
Hi Thierry,
A few small bugs I've noticed with filter_rad() in interactive mode:
-Any filter that removes samples/individuals causes failure in the next filtering step:
-detect_mixed_genomes doesn't abide by the parallel.core arg in filter_rad(); fixed by adding parallel.core = parallel.core in filter_rad:
-May need to remove strata=NULL from filter_hwe in filter_rad():
-when filtering by HWE, interactive mode doesn't always detect the asterisk inputs; I think this happened when I tried setting hw.pop.threshold equal to the number of pops, or it may happen when some strata are removed for having n < 10, but I don't remember exactly. I just tried to re-run on strata where none were removed and didn't get the error.
-in general, is there a way to exit the interactive mode? When the HWE filter couldn't detect my inputs, I had to restart the R session to get out.
-Transferring to genomic_converter requires doing the REF/ALT calibration again. Not a major issue but adds some time.
-purely aesthetic, but when running on Windows, the font choice in the plots (Helvetica?) causes warnings:
Not issues, but questions/suggestions/requests: -Are there better explanations for how outliers/q75/iqr are calculated and applied? Is outliers just outside 95% CI? -filter_coverage step returns a plot of max mean coverage; a plot for min mean would be useful -In filter_genotyping, is the threshold applied per-strata or only on the total? -Long LD filtering appears to work but only if pruned WITHOUT missing data statistics when CHROMs represent contigs; pruning with missing data statistics doesn't remove anything. Is there a reason for this? Actually I'm not sure loci are pruned either way. Is it possible to collapse the CHROMs down to a single CHROM to do the long LD filtering? -outputting the full function call with args entered during the interactive session would help with reproducibility -asking if you want to run a particular filtering step interactively, e.g. asking if you want to skip calculating HWE since it takes a long time