privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Error in snp_match() #177

Closed pmk2020 closed 3 years ago

pmk2020 commented 3 years ago

Hi Florian, Firstly, this is a great package and I am excited to use it. I came across the following error while using snp_match()

10,707,059 variants to be matched. 20,946 ambiguous SNPs have been removed. Some duplicates were removed. 122,195 variants have been matched; 0 were flipped and 103,812 were reversed. Error: Not enough variants have been matched.

Not sure, why I am receiving this error message. Any help is appreciated :) Thank you.

privefl commented 3 years ago

Please look at parameter match.min.prop in the documentation. It is probably that the positions are not in the same genome build. You can either convert them to another build using snp_modifyBuild() or use rsids by setting join_by_pos = FALSE instead if they are available in the two datasets (prefer the first option).

How could the documentation be improved to help explaining this?

pmk2020 commented 3 years ago

Thank you, Florian. I have a follow up question.

How much minimum proportion match is good to run LDpred2 on? And would it be ok to have different match.min.prop for each chromosome?

privefl commented 3 years ago

With good sumstats and HM3 variants, you should get at least 70% I guess. The default is 50%.

pmk2020 commented 3 years ago

and it would depend on the chromosome too, right? because I am able to read at 10% matching for chromosome 22

privefl commented 3 years ago

No, I guess it should not.

privefl commented 3 years ago

Have you tried with rsIDs instead? Do you know in which builds are the positions?

pmk2020 commented 3 years ago

Ahh so then maybe its not well-tailored with HM3 variants because 10% matching is quite low, I feel.

They are in build37. I dont have rsids for all SNPs in summary stats so I have not tried it.

pmk2020 commented 3 years ago

Okhay this maybe a basic question, but the genotype file to use for LDPred2, can it be my own genetic data file or does it have to be the genetic file provided by you?

privefl commented 3 years ago

No, you can compute the correlation matrix with whatever data you want, as long as it is not too small, and the ancestry matches the ancestry of the GWAS data from which you got the sumstats.

pmk2020 commented 3 years ago

So do you think 10% minimum matching proportion should be ok to run LDpred2?

privefl commented 3 years ago

I don't think 10% is okay at all. Do you get this with the data you have or with the HM3 data I provide?

pmk2020 commented 3 years ago

With data I have, but it is UK biobank chromosome 22 QC'ed data

pmk2020 commented 3 years ago

This is the code:

info_snp <- snp_match(sumstats, map, match.min.prop = 0.1) 10,707,059 variants to be matched. 19,331 ambiguous SNPs have been removed. Some duplicates were removed. 117,501 variants have been matched; 0 were flipped and 102,410 were reversed.

privefl commented 3 years ago

Okay, I didn't get it, you have only chromosome 22 here... Yeah, 100K variants matched is even too large. You would get at least 400K for the largest chromosomes, and you cannot really run LDpred2 with that many variants. I would recommend restricting to HM3.

pmk2020 commented 3 years ago

Yes, I only have chromosome 22 right now here.

pmk2020 commented 3 years ago

Just to confirm with HM3 you mean the hm3_variants.rds (in the data-raw) folder, right?

privefl commented 3 years ago

No, I mean the info object in the LDpred2 tutorial at https://privefl.github.io/bigsnpr/articles/LDpred2.html.

pmk2020 commented 3 years ago

Thank you, Florian! I will use map.rds for my analysis. Hopefully the matching proportion is not that low then.

pmk2020 commented 3 years ago

Hi Florian, I had a quick question while using the info object. I am following the code: example-with-provided-ldref.R code to run the analyses.

In the underneath code, where does the map_test come from? # Here, you also want to restrict to the variants present in your test data as well. For this, you can use something like in_test <- vctrs::vec_in(df_beta[, c("chr", "pos")], map_test[, c("chr", "pos")]) df_beta <- df_beta[in_test, ]

Also if I am running "auto" , you mentioned that it does not require any validation set in the manuscript, so I am wondering why should there be a test set in above code? I maybe misunderstanding the underlying aspect, therefore any help is much appreciated.

Thank you.

privefl commented 3 years ago

The test set is the set of individuals for which you want to derive the polygenic scores. You need to have the same variants in this set as the ones you used as input for LDpred. Otherwise, you'll get weights for variants you do not have!

pmk2020 commented 3 years ago

Sorry to bother you again, Florian! If I am using map.rds (ld-reference dataset), do I still need to provide an individual level dataset in order to run pred_inf/grid/auto? Since these functions require 'G' which is not present in the map.rds?

Thank you!

privefl commented 3 years ago

Yes, you need a test set, right? What are you using the PGS for?

pmk2020 commented 3 years ago

My test set would be UKBB, which during snp_match() may create the same problem (as above), right?

I am using it for outcome-related phenotypes (e.g. survival)

privefl commented 3 years ago

If it's UKBB, you're fine. All the variants will be available as the LD reference was made from UKBB variants.

pmk2020 commented 3 years ago

Thank you, Florian!