Duplicated sites in pixy output using bed intervals?

jsalt commented 3 years ago

Hi there,

I have a question about how pixy interprets bed intervals / concerns about potential duplication of sites in the output. Some context:

After using pixy to estimate Fst in 10kbp windows across the whole genome, I identified windows with significant outliers and would now like to calculate Fst for each site within those windows. To do this, I used bash/R to generate bed files with one bp intervals (0-indexed) for all sites within each window of interest. The bed files look like this (except properly tab-delimited):

VONY02000013.1 16930000 16930001
VONY02000013.1 16930001 16930002
VONY02000013.1 16930002 16930003

I'm then giving this bed file to pixy with the following command:

pixy --stats pi fst dxy \
--vcf 13-tissues-allsites-5dp-q30-75comp.recode.vcf.gz \
--populations tiss-phen.txt \
--bed_file sc13-sites-0.bed \
--output_folder tissues-pixy \
--output_prefix sc13-outliers-pixy \
--n_cores 20

Originally I tried providing a bed file with a single interval for each 10kbp window and setting the window_size to 1, but got an error that you cannot use both window_size and bed_file. The 1bp interval bed file strategy seemed more efficient to me than providing each interval separately because I have a bunch of intervals and they're not all adjacent (trying to avoid calculating for windows that I know are unimportant, basically).

I'm a little confused by the output, because it seems like most sites are listed twice with adjacent 1bp intervals, while others are listed once, but contain 2 SNPs (see screenshot). I didn't notice this until I filtered the output to include only those sites with significant Fst values and used those coordinates to output a vcf file of those sites. The vcf files always has ~half the number of sites than the number pixy identified (e.g. if I have 350 significant intervals in my pixy output, I get a vcf file with 181 sites). At first I thought it was an indexing issue, but after outputting a bunch of test cases I have confirmed that pixy says I have (identical Fst) outliers at, say, both sites 10-11 and 11-12, but when I filter my vcf it's just a single site at 10-11 (and 11-12 might be an invariant site, or might not be present in my dataset at all). Unless I'm missing something, it seems like based on pixy output alone I will over-estimate (by ~100%) the number of fixed SNPs in my dataset, for example.

I don't know if this is expected behavior and I'm just confused about how bed intervals work, or if this is some kind of bug, but any clarification you can offer would be much appreciated! Is there a better way to take the output of windowed Fst calculations and get site-specific estimates for those intervals?

Thanks so much in advance, I've really been enjoying using pixy and appreciate the great plotting tutorials!!

Jessie

ksamuk commented 3 years ago

Hi Jessie,

So with the bed files, pixy should (!) interpret everything as inclusive intervals (with the indexing in whatever system your genomic coordinates are in), so:

VONY02000013.1 16930000 16930001 VONY02000013.1 16930001 16930002

would share the 16930001 site. We should probably lay this out more explicitly in the manual, as this is perhaps non-standard (it's technically consistent with the BED spec, at least we thought it was?). Anyway, does that explain the behavior you're seeing, or is there something else going wrong? If not, it might help to have a look at your VCF for debugging purposes.

I think the easiest way to do what you want to do is to use the --sites_file option and have --window_size 1. That should give you single site estimates of FST limited to specific sites.

ksamuk commented 2 years ago

Hi Jessie, did this resolve your issue? Let me know if something seems wrong and we can put together a fix.

ksamuk commented 2 years ago

Closing this for now, let me know if you run into any other problems.

ksamuk / pixy

Duplicated sites in pixy output using bed intervals? #46