JNajar commented 5 months ago

I am calculating a PRS for sczhizophrenia, using the script found here: https://choishingwan.github.io/PRS-Tutorial/ldpred/ and using the Genome wide bed file option.

This exact script has been used successfully with a different set of genotype data. The genotype data that i am using has been QC and imputed using Sanger imputation service.

calculate LD

for (chr in 1:22) {

Extract SNPs that are included in the chromosome

ind.chr <- which(info_snp$chr == chr)

ind.chr2 <- info_snp$_NUM_ID_[ind.chr]

Calculate the LD

corr0 <- snp_cor(

genotype,

ind.col = ind.chr2,

ncores = NCORES,

infos.pos = POS2[ind.chr2],

size = 3 / 1000

)

if (chr == 1) {

ld <- Matrix::colSums(corr0^2)

corr <- as_SFBM(corr0, tmp)

} else {

ld <- c(ld, Matrix::colSums(corr0^2))

corr$add_columns(corr0, nrow(corr))

}

} There were 14 warnings (use warnings() to see them)

warnings() Warning messages: 1: NA or NaN values in the resulting correlation matrix. 2: NA or NaN values in the resulting correlation matrix. 3: NA or NaN values in the resulting correlation matrix. 4: NA or NaN values in the resulting correlation matrix. 5: NA or NaN values in the resulting correlation matrix. 6: NA or NaN values in the resulting correlation matrix. 7: NA or NaN values in the resulting correlation matrix. 8: NA or NaN values in the resulting correlation matrix. 9: NA or NaN values in the resulting correlation matrix. 10: NA or NaN values in the resulting correlation matrix. 11: NA or NaN values in the resulting correlation matrix. 12: NA or NaN values in the resulting correlation matrix. 13: NA or NaN values in the resulting correlation matrix. 14: NA or NaN values in the resulting correlation matrix.

We assume the fam order is the same across different chromosomes

fam.order <- as.data.table(obj.bigSNP$fam)

Rename fam order

setnames(fam.order,

c("family.ID", "sample.ID"),

c("FID", "IID"))

6. Perform LD score regression

df_beta <- info_snp[,c("beta", "beta_se", "n_eff", "_NUMID")]

ldsc <- snp_ldsc( ld,

length(ld),

chi2 = (df_beta$beta / df_beta$beta_se)^2,

sample_size = df_beta$n_eff,

blocks = NULL) Error in if (max(abs(pred - pred0)) < 1e-06) break : missing value where TRUE/FALSE needed

summary(ld) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.175 10.654 16.905 21.566 26.312 474.208 9980

When i try to remove NAs from ld, then i cannot calculate ldsc:

6. Perform LD score regression

df_beta <- info_snp[,c("beta", "beta_se", "n_eff", "_NUMID")] ldsc <- snp_ldsc( ld2,

length(ld2),

chi2 = (df_beta$beta / df_beta$beta_se)^2,

sample_size = df_beta$n_eff,

blocks = NULL) Error: Incompatibility between dimensions. 'chi2' and 'ld_score' should have the same length.

I have inluced the info on the IID dataset and the obj.bigSNP

str(obj.bigSNP) List of 3 $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 fields ..$ extptr : ..$ extptr_rw : ..$ nrow : int 3981 ..$ ncol : int 1054151 ..$ type : Named int 1 .. ..- attr(*, "names")= chr "unsigned char" ..$ backingfile : chr "/project/holstegelab/Share/gwas_array/TOPMED/analyses/JennaNajar/H70-study/Xinxin_Caroline/data/genotypes/LDPre"| truncated ..$ is_read_only: logi FALSE ..$ address : ..$ address_rw : ..$ bk : chr "/project/holstegelab/Share/gwas_array/TOPMED/analyses/JennaNajar/H70-study/Xinxin_Caroline/data/genotypes/LDPre"| truncated ..$ rds : chr "/project/holstegelab/Share/gwas_array/TOPMED/analyses/JennaNajar/H70-study/Xinxin_Caroline/data/genotypes/LDPre"| truncated ..$ is_saved : logi TRUE ..$ type_chr : chr "unsigned char" ..$ type_size : int 1 ..$ file_size : num 4.2e+09 ..$ code256 : num [1:256] 0 1 2 NA NA NA NA NA NA NA ... ..and 26 methods, of which 12 are possibly relevant: .. add_columns, as.FBM, bm, bm.desc, check_dimensions, .. check_write_permissions, copy#envRefClass, initialize, initialize#FBM, .. save, show#envRefClass, show#FBM $ fam :'data.frame': 3981 obs. of 6 variables: ..$ family.ID : int [1:3981] 0 0 0 0 0 0 0 0 0 0 ... ..$ sample.ID : chr [1:3981] "1234_1234" "1284_1284" "1300_1300" "1345_1345" ... ..$ paternal.ID: int [1:3981] 0 0 0 0 0 0 0 0 0 0 ... ..$ maternal.ID: int [1:3981] 0 0 0 0 0 0 0 0 0 0 ... ..$ sex : int [1:3981] 0 0 0 0 0 0 0 0 0 0 ... ..$ affection : int [1:3981] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ... $ map :'data.frame': 1054151 obs. of 6 variables: ..$ chromosome : int [1:1054151] 1 1 1 1 1 1 1 1 1 1 ... ..$ marker.ID : chr [1:1054151] "rs3131972" "rs3131969" "rs1048488" "rs12562034" ... ..$ genetic.dist: int [1:1054151] 0 0 0 0 0 0 0 0 0 0 ... ..$ physical.pos: int [1:1054151] 752721 754182 760912 768448 779322 838555 846808 853954 854250 864938 ... ..$ allele1 : chr [1:1054151] "G" "G" "T" "A" ... ..$ allele2 : chr [1:1054151] "A" "A" "C" "G" ...

attr(*, "class")= chr "bigSNP" str(pheno) 'data.frame': 3981 obs. of 3 variables: $ FirstColumn: int 0 0 0 0 0 0 0 0 0 0 ... $ IID : chr "ADCL1_ADCL1" "ADCL2_ADCL2" "ADCL3_ADCL3" "ADCL4_ADCL4" ... $ FID : int 0 0 0 0 0 0 0 0 0 0 ...

Do you possibly know why i am getting these warnings in the LD calculation? What can i do troubleshoot this?

Kind regards,

privefl commented 5 months ago

Could you reformat the code blocks please?

JNajar commented 5 months ago

## calculate LD
for (chr in 1:22) {
     # Extract SNPs that are included in the chromosome
     ind.chr <- which(info_snp$chr == chr)
     ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr]
     # Calculate the LD
     corr0 <- snp_cor(
             genotype,
             ind.col = ind.chr2,
             ncores = NCORES,
             infos.pos = POS2[ind.chr2],
             size = 3 / 1000
         )
     if (chr == 1) {
        ld <- Matrix::colSums(corr0^2)
         corr <- as_SFBM(corr0, tmp)
     } else {
         ld <- c(ld, Matrix::colSums(corr0^2))
         corr$add_columns(corr0, nrow(corr))
     }
 }
There were 14 warnings (use warnings() to see them)

 warnings()
Warning messages:
1: NA or NaN values in the resulting correlation matrix.
2: NA or NaN values in the resulting correlation matrix.
3: NA or NaN values in the resulting correlation matrix.
4: NA or NaN values in the resulting correlation matrix.
5: NA or NaN values in the resulting correlation matrix.
6: NA or NaN values in the resulting correlation matrix.
7: NA or NaN values in the resulting correlation matrix.
8: NA or NaN values in the resulting correlation matrix.
9: NA or NaN values in the resulting correlation matrix.
10: NA or NaN values in the resulting correlation matrix.
11: NA or NaN values in the resulting correlation matrix.
12: NA or NaN values in the resulting correlation matrix.
13: NA or NaN values in the resulting correlation matrix.
14: NA or NaN values in the resulting correlation matrix.

##I still tried to go through with the script: 
## We assume the fam order is the same across different chromosomes
 fam.order <- as.data.table(obj.bigSNP$fam)

 ## Rename fam order
 setnames(fam.order,
         c("family.ID", "sample.ID"),
         c("FID", "IID"))

 # 6. Perform LD score regression
 df_beta <- info_snp[,c("beta", "beta_se", "n_eff", "_NUM_ID_")]

 ldsc <- snp_ldsc(   ld,
                     length(ld),
                     chi2 = (df_beta$beta / df_beta$beta_se)^2,
                     sample_size = df_beta$n_eff,
                     blocks = NULL)
Error in if (max(abs(pred - pred0)) < 1e-06) break :
  missing value where TRUE/FALSE needed ##got this error instead

length(ld)
[1] 1049911
summary(ld)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
  1.175  10.654  16.905  21.566  26.312 474.208    9980

## Remove all NAs from the ld
ld2 <- na.omit(ld)
length(ld2)
[1] 1039931
 length(ld)
[1] 1049911

## We assume the fam order is the same across different chromosomes
 fam.order <- as.data.table(obj.bigSNP$fam)

## Rename fam order
setnames(fam.order,
        c("family.ID", "sample.ID"),
         c("FID", "IID"))

# 6. Perform LD score regression
df_beta <- info_snp[,c("beta", "beta_se", "n_eff", "_NUM_ID_")]
 ldsc <- snp_ldsc(   ld2,
                     length(ld2),
                   chi2 = (df_beta$beta / df_beta$beta_se)^2,
                     sample_size = df_beta$n_eff,
                     blocks = NULL)
Error: Incompatibility between dimensions.
'chi2' and 'ld_score' should have the same length. ## but this did of course not work

str(obj.bigSNP)
List of 3
 $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 fields
  ..$ extptr      :<externalptr>
  ..$ extptr_rw   :<externalptr>
  ..$ nrow        : int 3981
  ..$ ncol        : int 1054151
  ..$ type        : Named int 1
  .. ..- attr(*, "names")= chr "unsigned char"
  ..$ backingfile : chr "/project/holstegelab/Share/gwas_array/TOPMED/analyses/JennaNajar/H70-study/Xinxin_Caroline/data/genotypes/LDPre"| __truncated__
  ..$ is_read_only: logi FALSE
  ..$ address     :<externalptr>
  ..$ address_rw  :<externalptr>
  ..$ bk          : chr "/project/holstegelab/Share/gwas_array/TOPMED/analyses/JennaNajar/H70-study/Xinxin_Caroline/data/genotypes/LDPre"| __truncated__
  ..$ rds         : chr "/project/holstegelab/Share/gwas_array/TOPMED/analyses/JennaNajar/H70-study/Xinxin_Caroline/data/genotypes/LDPre"| __truncated__
  ..$ is_saved    : logi TRUE
  ..$ type_chr    : chr "unsigned char"
  ..$ type_size   : int 1
  ..$ file_size   : num 4.2e+09
  ..$ code256     : num [1:256] 0 1 2 NA NA NA NA NA NA NA ...
  ..and 26 methods, of which 12 are  possibly relevant:
  ..  add_columns, as.FBM, bm, bm.desc, check_dimensions,
  ..  check_write_permissions, copy#envRefClass, initialize, initialize#FBM,
  ..  save, show#envRefClass, show#FBM
 $ fam      :'data.frame':      3981 obs. of  6 variables:
  ..$ family.ID  : int [1:3981] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ sample.ID  : chr [1:3981] "1234_1234" "1284_1284" "1300_1300" "1345_1345" ...
  ..$ paternal.ID: int [1:3981] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ maternal.ID: int [1:3981] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ sex        : int [1:3981] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ affection  : int [1:3981] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...
 $ map      :'data.frame':      1054151 obs. of  6 variables:
  ..$ chromosome  : int [1:1054151] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ marker.ID   : chr [1:1054151] "rs3131972" "rs3131969" "rs1048488" "rs12562034" ...
  ..$ genetic.dist: int [1:1054151] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ physical.pos: int [1:1054151] 752721 754182 760912 768448 779322 838555 846808 853954 854250 864938 ...
  ..$ allele1     : chr [1:1054151] "G" "G" "T" "A" ...
  ..$ allele2     : chr [1:1054151] "A" "A" "C" "G" ...
 - attr(*, "class")= chr "bigSNP"
> str(pheno)
'data.frame':   3981 obs. of  3 variables:
 $ FirstColumn: int  0 0 0 0 0 0 0 0 0 0 ...
 $ IID        : chr  "ADCL1_ADCL1" "ADCL2_ADCL2" "ADCL3_ADCL3" "ADCL4_ADCL4" ...
 $ FID        : int  0 0 0 0 0 0 0 0 0 0 ...

privefl commented 5 months ago

Just compute snp_MAF(genotype, ind.col = info_snp$`_NUM_ID_`, ncores = NCORES) and filter out small MAFs from info_snp.

I'll probably need to add this step in the tuto as well.

JNajar commented 5 months ago

Should this step be after perfoming SNP matching?

Like this: sumstats$chr <- as.integer(sumstats$chr) #make chr an integer info_snp <- snp_match(sumstats, map)

snp_MAF(genotype, ind.col = info_snp$_NUM_ID_, ncores = NCORES) # filter out small MAFs from info_snp.

Or after ld calculations?

privefl commented 5 months ago

Before the loop.

JNajar commented 5 months ago

I still get the same errors...

perform SNP matching

sumstats$chr <- as.integer(sumstats$chr) #make chr an integer info_snp <- snp_match(sumstats, map)

snp_MAF(genotype, ind.col = info_snp$_NUM_ID_, ncores = NCORES) # filter out small MAFs from info_snp.

dim(info_snp)

1049911 12

write.table(info_snp, "/project/holstegelab/Share/gwas_array/TOPMED/analyses/JennaNajar/H70-study/results/SCZPRS_LDPred2_sumstats.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names=T)

Assign the genotype to a variable for easier downstream analysis

genotype <- obj.bigSNP$genotypes

Rename the data structures

CHR <- map$chr POS <- map$pos

get the CM information from 1000 Genome

will download the 1000G file to the current directory (".")

POS2 <- snp_asGeneticPos(CHR, POS, dir = ".")

calculate LD

for (chr in 1:22) {

Extract SNPs that are included in the chromosome

ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr]
# Calculate the LD
corr0 <- snp_cor(
        genotype,
        ind.col = ind.chr2,
        ncores = NCORES,
        infos.pos = POS2[ind.chr2],
        size = 3 / 1000
    )
if (chr == 1) {
    ld <- Matrix::colSums(corr0^2)
    corr <- as_SFBM(corr0, tmp)
} else {
    ld <- c(ld, Matrix::colSums(corr0^2))
    corr$add_columns(corr0, nrow(corr))
}

} There were 14 warnings (use warnings() to see them)

We assume the fam order is the same across different chromosomes

fam.order <- as.data.table(obj.bigSNP$fam)

Rename fam order

setnames(fam.order, c("family.ID", "sample.ID"), c("FID", "IID"))

6. Perform LD score regression

df_beta <- info_snp[,c("beta", "beta_se", "n_eff", "_NUMID")]

ldsc <- snp_ldsc( ld, length(ld), chi2 = (df_beta$beta / df_beta$beta_se)^2, sample_size = df_beta$n_eff, blocks = NULL) Error in if (max(abs(pred - pred0)) < 1e-06) break : missing value where TRUE/FALSE needed

privefl commented 5 months ago

You need to use the result from snp_MAF() to filter info_snp. Look at the tutorial; I've just updated it.

Also, please use the code formatting from GitHub, the '<>' button after selecting your code.

JNajar commented 5 months ago

Thank you for the tutorial

I used this code before the loop:

ind.row <- rows_along(genotype) maf <- snp_MAF(genotype, ind.row = ind.row, ind.col = info_snp$_NUM_ID_, ncores = NCORES) maf_thr <- 1 / sqrt(length(ind.row)) # threshold privefl likes to use info_snp <- info_snp[maf > maf_thr, ]

However, i still get NA or NAN values, but now instead of 14 warnings I have 9.

privefl commented 5 months ago

What do you get for

ind <- which(is.na(ld))
length(ind)
summary(snp_MAF   (genotype, ind.col = info_snp$`_NUM_ID_`[ind]))
summary(big_counts(genotype, ind.col = info_snp$`_NUM_ID_`[ind])[4, ])

JNajar commented 5 months ago

ind <- which(is.na(ld))

length(ind) [1] 6212

summary(snp_MAF (genotype, ind.col = info_snp$_NUM_ID_[ind]))
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.01595 0.12572 0.24221 0.25118 0.38015 0.50000

summary(big_counts(genotype, ind.col = info_snp$_NUM_ID_[ind])[4, ]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 0 0 0 0 0

privefl commented 5 months ago

MAFs are fine, and you have absolutely no missing value, so everything should be fine. I don't know what's going on.

If you want to move on, just do corr0@x[is.na(corr0@x)] <- 0 after computing corr0.

JNajar commented 5 months ago

So strange.

The calculation of corr0 is within the loop, so after the loop you mean? If so, when i do that, im still getting

ldsc <- snp_ldsc( ld, length(ld), chi2 = (df_beta$beta / df_beta$beta_se)^2, sample_size = df_beta$n_eff, blocks = NULL) Error in if (max(abs(pred - pred0)) < 1e-06) break : missing value where TRUE/FALSE needed

privefl commented 5 months ago

Within the loop, just after computing corr0.

JNajar commented 5 months ago

It worked! Thank you so much for your help.

privefl commented 5 months ago

It's just a workaround. It would be good to understand why there are still NaNs values.

JNajar commented 5 months ago

Do you have an idea of where i could start tracking it down?

privefl commented 5 months ago

Take one NaN value in corr0, get the two corresponding column indices in genotype and try to compute cor() between these two variants.

privefl / bigsnpr

LD calculation: NA or NAN values in the resulting correlation matrix #472

calculate LD

Extract SNPs that are included in the chromosome

Calculate the LD

We assume the fam order is the same across different chromosomes

Rename fam order

6. Perform LD score regression

6. Perform LD score regression

perform SNP matching

1049911 12

Assign the genotype to a variable for easier downstream analysis

Rename the data structures

get the CM information from 1000 Genome

will download the 1000G file to the current directory (".")

calculate LD

Extract SNPs that are included in the chromosome

We assume the fam order is the same across different chromosomes

Rename fam order

6. Perform LD score regression