Overlapping regions in create_finemapper_jobs

omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)

MIT License

89 stars 22 forks source link

Overlapping regions in create_finemapper_jobs #78

Closed Ojami closed 2 years ago

Ojami commented 2 years ago

Hi,

Is there any specific reason why create_finemapper_jobs.py creates overlapping regions in --jobs-file ? For instance, head -2 polyfun_all_jobs.txt

python3 finemapper.py --chr 22 --start 42000001 --end 45000001 --out tmp/polyfun_all.chr22.42000001_45000001.gz --ld UKBB_LD/chr22_42000001_45000001 --method susie --sumstats sumstat/sumstat.22.snpvar_ridge_constrained.gz --n 374851 --memory 1 --max-num-causal 5
python3 finemapper.py --chr 22 --start 43000001 --end 46000001 --out tmp/polyfun_all.chr22.43000001_46000001.gz --ld UKBB_LD/chr22_43000001_46000001 --method susie --sumstats sumstat/sumstat.22.snpvar_ridge_constrained.gz --n 374851 --memory 1 --max-num-causal 5

give the same credible set. I assume this happens because some index signals (p < --pvalue-cutoff) within the region happen to be within [42000001, 45000001], while others fall within [43000001, 46000001]. Generally speaking, shouldn't the regions be mutually exclusive?

Thanks/Oveis

omerwe commented 2 years ago

The approach we took in our paper is to fine-map each and every region that includes even one SNP with p-value < cutoff (regardless of where the SNP is found in the region), including overlapping regions. The reason is that for highly polygenic traits the concept of "lead SNP" is not well-defined. In other words, there could be many many causal SNPs across the genome, and choosing just one region to fine-map is too arbitrary to identify all of them. The results of fine-mapping are not accurate for SNPs at the edges of a fine-mapped region (because we ignore their LD-neighboring SNPs).

In practice we construct all 3Mb regions (starting 1Mb apart). However, the script drops regions that don't include even a single SNP with p-value < cutoff. More details in the paper :)

Ojami commented 2 years ago

Makes sense! cheers!

Ojami commented 2 years ago

Sorry for reopening this again, but I have kind of similar question for GW fine mapping, specially with aggregate_finemapper_results.py when aggregating distant loci on the same chr. To explain this, I'm using biochemistry_Cholesterol.txt.gz from ployfun paper. If I sort it by chr and PIP:

CHR	BP	SNP	A1	A2	pip	causal_set
1	55505647	rs11591147	G	T	1	1
1	55519174	rs499883	G	A	1	3
1	220970095	rs17850677	T	A	0.99919	2
1	55698134	rs570635280	C	T	0.99425	2
1	220970028	rs2642438	A	G	0.9824	1
1	55520864	rs557435	A	G	0.88311	9
1	145827097	rs71618972	T	G	0.83767	-999
1	55521313	rs472495	G	T	0.66015	5

Based on this table, I would (incorrectly) observe that rs17850677 and rs570635280 both belong to the same credible set, while they're not even LD friend. So, they belong to two different credible sets, and happened to have the same credible set index. This happens because aggregate_finemapper_results.py aggregates all loci within the same chr into a single file, then it becomes difficult to group varinats based on their credible sets (as in this example). Of course this doesn't happen when user runs separate fine mapping per each loci. Am I missing something or is this a bug with aggregate_finemapper_results.py?

Thanks!

omerwe commented 2 years ago

Hi @Ojami ,

Indeed, thanks for the bug report. Since we merge different fine-mapping outputs from different (possibly overlapping) regions, the ids of the credible sets are region-specific. To address this, I now updated the aggregate_finemapper_jobs script to explicitly list the region name in the credible set. Can you please git pull and try the latest version? I'll also update the wiki to explain this subtlety.

Ojami commented 2 years ago

Perfect, thanks!