Closed Ojami closed 1 year ago
@Ojami set_snpid_index doesn't flip A1 or A2, it just creates a unique id for each SNP. This is just our way to address the fact that rsids aren't necessarily unique 🤦 When the software needs to look at Z scores, it does take care to look at which variant is in A1 and which is in A2 (regardless of the unique SNP index).
I hope it's clear, please let me know if not
Thanks for your answer Omer!
Yes that cleared it up! I was thinking more from a matching perspective. As an example, if I flip A1/A2 columns in my input sumstat file (and change BETA and A1FREQ columns), the number of overlapping variants will differ compared to the original file (--allow-missing
). So, ideally (given the same population as UKBB in my example), one would opt for the maximum overlap between sumstat and ldscore files (?).
I guess smart_merge
can do that if finds such non-matching variants between sumstat and ldscore:
if x.index.name == 'snpid' and y.index.name == 'snpid':
out = pd.merge(x, y.drop(columns=['SNP']), how='inner', left_index=True, right_index=True)
df_nonmatching = y[~y.index.isin(x.index)]
x = x[~x.index.isin(out.index)]
df_nonmatching.index = df_nonmatching.index.str.split('.').str[:2].str.join(
'.') + '.' + df_nonmatching.index.str.split('.').str[2:].str[::-1].str.join('.')
df_matching = pd.merge(x, df_nonmatching.drop(columns=['SNP']),
how='inner', left_index=True,
right_index=True)
# flip A1, A2 and sing of Z
if df_matching.shape[0] > 0:
df_matching['Z'] = -df_matching['Z']
df_matching['A1'], df_matching['A2'] = df_matching['A2'], df_matching['A1']
out = pd.concat([out, df_matching], axis=0)
out = out.loc[np.sort(out.index)]
@Ojami flipping A1 and A2 for SNPs shouldn't change the number of matching variants. However, please note that for small indels, the identities of A1 and A2 actually make a real difference, because they allow it to distinguish between deletion (AA->A) and insertion (A->AA). I suspect that the different numbers you've seen are because of small indels.
If you can come up with a simple example that demonstrates these changing numbers (and isn't due to indels), I'm happy to look at it.
@omerwe Yes they are all indels in fact (apologies I forgot to mention that). my snippet above can handle such indels, but it's not computationally friendly. I guess in the end, user must take care of such indels beforehand and flip them if necessary.
On a similar and related note, compute_ld_bgen
would throw error when trying to assert df_snp.shape[0]==1
because of non-unique rsids, which can be fixed by:
df_z['ID'] = df_z[['CHR', 'BP', 'A1', 'A2']].apply(lambda row: '.'.join(row.values.astype(str)), axis=1)
rsids = bfile.fetch(self.chr, locus_start, locus_end)
for snp_i, rsid in enumerate(rsids):
if rsid.rsid not in df_z['SNP'].values: continue
snp_alleles = rsid.alleles
snp_chrom = rsid.chrom
snp_pos = rsid.pos
assert len(snp_alleles) == 2, 'cannot handle SNPs with more than two alleles'
df_snp = df_z.query('SNP == "%s"' % (rsid.rsid))
try:
assert df_snp.shape[0] == 1
except:
# handle non-unique rsids in sumstat
rsid_index = [f"{rsid.chrom}.{rsid.pos}.{alleles}" for alleles in [".".join(rsid.alleles)]]
df_snp = df_snp.loc[df_snp.ID.isin(rsid_index)]
Best/Oveis
Yes they are all indels in fact (apologies I forgot to mention that). my snippet above can handle such indels, but it's not computationally friendly. I guess in the end, user must take care of such indels beforehand and flip them if necessary.
@Ojami I think you might be interested in the flag --allow-swapped-indel-alleles
, which I implemented in #67. This changes the default behavior to be less stringent when matching indels (and thus you wouldn't have to manually flip them beforehand).
@jdblischak Thanks for this! Yes I checked this out as well, but as far as I see, this flag has not ben implemented in polyfun
and is only available in finemapper
. Am I mistaken?
this flag has not ben implemented in polyfun and is only available in finemapper
That's correct. I needed this flag when I was performing non-functional fine-mapping. It makes sense to add it to polyfun.py
as well, since it is calling the same underlying function polyfun_utils.set_snpid_index()
I see! I'll try to add it to polyfun
as well. This will fix the problem with the indels matching.
Thanks a bunch!
I'll try to add it to polyfun as well.
@Ojami Just checking in for a status update. Do you still want to implement the flag and send a Pull Request?
@jdblischak not yet, didn't have much time to check it. But did a quick try and looks polyfun_utils.set_snpid_index()
is not enough and parse.set_snpid_index
should be edited accordingly, since this is the function that poylfun
calls first.
@jdblischak there may be an issue with allow-swapped-indel-alleles
flag when both A1/A2 and A2/A1 (in and del) are present in the sumstats file:
CHR BP SNP N MAF A2 A1 Z
___ _______ _____________ _____ ________ _______ _______ _________
1 5810611 "rs541092568" 35146 0.16259 "T" "TTGTC" -0.056526
1 5810611 "rs541092568" 35146 0.011225 "TTGTC" "T" 0.13351
Since allow_duplicates
is False by default (which makes sense), set_snpid_index
would throw an error in such cases. Maybe user must dedup such indels beforehand (maybe keep the one with lowest P). What do you think?
Best/Oveis
there may be an issue with
allow-swapped-indel-alleles
flag when both A1/A2 and A2/A1 (in and del) are present in the sumstats file:
@Ojami Now I'm a bit confused. Your data clearly distinguish between the insertion and deletion. Each indel has its own MAF and Z score. If these were measuring the same allele, then the MAF should be the same, and the Z score would only have a different sign.
I created the flag --allow-swapped-indel-alleles
(PR #67) for the use case where both my GWAS data and my reference LD matrix only had a single indel variant at the position, and the indel was getting unnecessarily removed simply because the A1/A2 designations were swapped.
Based on skimming the above thread, I think you should be using the default behavior that @omerwe described above:
please note that for small indels, the identities of A1 and A2 actually make a real difference, because they allow it to distinguish between deletion (AA->A) and insertion (A->AA). I suspect that the different numbers you've seen are because of small indels.
But in general, I am only comfortable advising you on the technical aspects of running PolyFun. In other words, I can tell you which flags to use to get a certain behavior. But I don't have the time to determine which behavior is correct for your particular data set. Either you or a colleague will need to determine how to handle these indel variants for your fine-mapping application. For example, if you are most interested in fine-mapping a few particular regions, and those regions don't have these indel variants, then you could save yourself the trouble by simply removing them
Hi,
I may be quite wrong, but I was wondering why
set_snpid_index
inparse
function behaves in a way that flips A1/A2 in the resulting dataframe index for some variants? I have a summary stat file, with this particular SNP (output ofset_snp_index
):Next, when the function parses
baselineLF2.2.UKB.21.l2.ldscore.parquet
, it flips A1/A2 (annotation columns have been removed):So, from PolyFun's perspective, this is a match, however A1/A2 is different between the sumstat and ld.score file. Shouldn't the function flip the sign of Z column for such SNPs as well? Am I missing something?
Thanks Best/Oveis