Closed jsalt closed 2 years ago
Hi Jessie,
So with the bed files, pixy should (!) interpret everything as inclusive intervals (with the indexing in whatever system your genomic coordinates are in), so:
VONY02000013.1 16930000 16930001 VONY02000013.1 16930001 16930002
would share the 16930001 site. We should probably lay this out more explicitly in the manual, as this is perhaps non-standard (it's technically consistent with the BED spec, at least we thought it was?). Anyway, does that explain the behavior you're seeing, or is there something else going wrong? If not, it might help to have a look at your VCF for debugging purposes.
I think the easiest way to do what you want to do is to use the --sites_file
option and have --window_size 1
. That should give you single site estimates of FST limited to specific sites.
Hi Jessie, did this resolve your issue? Let me know if something seems wrong and we can put together a fix.
Closing this for now, let me know if you run into any other problems.
Hi there,
I have a question about how pixy interprets bed intervals / concerns about potential duplication of sites in the output. Some context:
After using pixy to estimate Fst in 10kbp windows across the whole genome, I identified windows with significant outliers and would now like to calculate Fst for each site within those windows. To do this, I used bash/R to generate bed files with one bp intervals (0-indexed) for all sites within each window of interest. The bed files look like this (except properly tab-delimited):
I'm then giving this bed file to pixy with the following command:
Originally I tried providing a bed file with a single interval for each 10kbp window and setting the
window_size
to 1, but got an error that you cannot use bothwindow_size
andbed_file
. The 1bp interval bed file strategy seemed more efficient to me than providing each interval separately because I have a bunch of intervals and they're not all adjacent (trying to avoid calculating for windows that I know are unimportant, basically).I'm a little confused by the output, because it seems like most sites are listed twice with adjacent 1bp intervals, while others are listed once, but contain 2 SNPs (see screenshot). I didn't notice this until I filtered the output to include only those sites with significant Fst values and used those coordinates to output a vcf file of those sites. The vcf files always has ~half the number of sites than the number pixy identified (e.g. if I have 350 significant intervals in my pixy output, I get a vcf file with 181 sites). At first I thought it was an indexing issue, but after outputting a bunch of test cases I have confirmed that pixy says I have (identical Fst) outliers at, say, both sites 10-11 and 11-12, but when I filter my vcf it's just a single site at 10-11 (and 11-12 might be an invariant site, or might not be present in my dataset at all). Unless I'm missing something, it seems like based on pixy output alone I will over-estimate (by ~100%) the number of fixed SNPs in my dataset, for example.
I don't know if this is expected behavior and I'm just confused about how bed intervals work, or if this is some kind of bug, but any clarification you can offer would be much appreciated! Is there a better way to take the output of windowed Fst calculations and get site-specific estimates for those intervals?
Thanks so much in advance, I've really been enjoying using pixy and appreciate the great plotting tutorials!!
Jessie