suchestoncampbelllab / gwasurvivr

GWAS Survival Package in R
11 stars 12 forks source link

Error in genotypes[!blankSNPs, cox.params$ids] : subscript out of bounds #26

Closed lucavd closed 1 year ago

lucavd commented 1 year ago

I see this is the same issue of #24 but I give you some more context:

plinkCoxSurv(bed.file = bed.file,
             covariate.file = covariate.file_2,
             id.column = "IID",
             sample.ids = samples.id,
             time.to.event = "Time_event",
             event = "status_LoA",
             covariates = c("Frame"),
             inter.term = NULL,
             print.covs = "only",
             out.file = "COX_completo_pca_proned_only_frame_0_05",
             maf.filter = 0.05,
             flip.dosage = FALSE,
             verbose = TRUE)

I checked the first time this error was reported on issue #4 but

> all.equal(samples.id, covariate.file_2$IID) [1] TRUE

This code was working a year ago so I:

All gave the same error.

I also tried simulated bed-bim-fam files with the following code:

library(genio)
# write your genotype matrix stored in an R native matrix

# (here we create a small example with random data)
# create 10 random genotypes
X <- rbinom(10, 2, 0.5)
# replace 3 random genotypes with missing values
X[sample(10, 3)] <- NA
# turn into 5x2 matrix
X <- matrix(X, nrow = 5, ncol = 2)

# also create a simulated phenotype vector
pheno <- rnorm(2) # two individuals as above

# write simulated data to all BED/BIM/FAM files in one handy command
# missing BIM and FAM columns are automatically generated
# data dimensions are validated for provided data
write_plink('random', X, pheno = pheno)

Same error

michiganCoxSurv example works as intended

Provided example works as intended.

Full error output:

Covariates included in the models are: Frame
557 samples are included in the analysis
Start file conversion from PLINK BED to SNP GDS ...
    BED file: "C:/Users/LucaVedovelli/Documents/GitHub/2022.GWAS_duchenne/data raw/plink_DMD_clean_pca_proned.bed"
        SNP-major mode (Sample X SNP), 390.7M
    FAM file: "C:/Users/LucaVedovelli/Documents/GitHub/2022.GWAS_duchenne/data raw/plink_DMD_clean_pca_proned.fam"
    BIM file: "C:/Users/LucaVedovelli/Documents/GitHub/2022.GWAS_duchenne/data raw/plink_DMD_clean_pca_proned.bim"
Thu Nov  3 22:54:11 2022     (store sample id, snp id, position, and chromosome)
    start writing: 637 samples, 2560573 SNPs ...
[==================================================] 100%, completed, 14s 
Thu Nov  3 22:54:25 2022    Done.
Optimize the access efficiency ...
Clean up the fragments of GDS file:
    open the file 'C:\Users\LUCAVE~1\AppData\Local\Temp\RtmpchWOlj\4bc44a76303c.gds' (404.1M)
    # of fragments: 43
    save to 'C:\Users\LUCAVE~1\AppData\Local\Temp\RtmpchWOlj\4bc44a76303c.gds.tmp'
    rename 'C:\Users\LUCAVE~1\AppData\Local\Temp\RtmpchWOlj\4bc44a76303c.gds.tmp' (404.1M, reduced: 276B)
    # of fragments: 20
***** Compression time ******
User:54.91
System: 2.02
Elapsed: 57.5
*****************************
Analyzing part 1/257...
Error in genotypes[!blankSNPs, cox.params$ids] : subscript out of bounds

Pre processing code:

bed.file <- here::here('data raw/plink_DMD_clean_pca_proned.bed')

covariate.file_2 <- read.table(here::here('data raw/dataset_corretto_completo.txt'), sep="\t", header=TRUE) |> 

  mutate(IID = as.character(IID),
         Time_event = as.numeric(Time_event),
         status_LoA = as.integer(status_LoA)) |> 

  filter(!is.na(Time_event)) |> 

  filter(!is.na(status_LoA)) |> 

  mutate(status_LoA = ifelse(status_LoA == 1, 0L, 1L))

samples.id <- covariate.file_2$IID |> as.character()

Covariate File:

image

lucavd commented 1 year ago

I just update and close this issue.

The problem was in the .bed file where some IDs were missing compared to the pheno file.

I found out it using the library(genio) and read_plink function. I got the IDs from the .fam file and then compared them to the pheno file and I found the problem.

I hope this can help some others

Bests and thank you for your work

L