Error on GWAS summary stats imputation

jcabanad commented 3 years ago

Hi

I'm trying to impute summary-stats data. I could run properly the harmonization pipeline, but it give me an error when I run the imputation script on chr1. I'm using Python/3.8.2-GCCcore-9.3.0 and pyarrow-3.0.0.

$ python3 summary-gwas-imputation/src/gwas_summary_imputation.py \

-by_region_file data/data/eur_ld.bed.gz \ -gwas_file outpu/harmonized_gwas/Summary_results_iPSYCH_PGC_10_euro_excluding_span_filtered_harmo.txt.gz \ -parquet_genotype data/data/reference_panel_1000G/chr1.variants.parquet \ -parquet_genotype_metadata data/data/reference_panel_1000G/variant_metadata.parquet \ -window 100000 \ -parsimony 7 \ -chromosome 22 \ -regularization 0.1 \ -frequency_filter 0.01 \ -sub_batches 10 \ -sub_batch 0 \ --standardise_dosages \ -output results_summary_imputation/ADHD_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz INFO - Beginning process INFO - Creating context by variant INFO - Loading study INFO - Loading variants' parquet file Traceback (most recent call last): File "summary-gwas-imputation/src/gwas_summary_imputation.py", line 97, in run(args) File "summary-gwas-imputation/src/gwas_summary_imputation.py", line 60, in run results = run_by_region(args) File "summary-gwas-imputation/src/gwas_summary_imputation.py", line 40, in run_by_region context = SummaryImputationUtilities.context_by_region_from_args(args) File "/home/juditc/ADHD/GWAS_TDAH/TDAH/GWAS_TDAH_b38/MetaXcan/MetaXcan_nou/software/summary-gwas-imputation/src/genomic_tools_lib/summary_imputation/Utilities.py", line 229, in context_by_region_from_args study = load_study(args) File "/home/juditc/ADHD/GWAS_TDAH/TDAH/GWAS_TDAH_b38/MetaXcan/MetaXcan_nou/software/summary-gwas-imputation/src/genomic_tools_lib/summary_imputation/Utilities.py", line 162, in load_study study = Parquet.study_from_parquet(args.parquet_genotype, args.parquet_genotype_metadata, chromosome=args.chromosome) File "/home/juditc/ADHD/GWAS_TDAH/TDAH/GWAS_TDAH_b38/MetaXcan/MetaXcan_nou/software/summary-gwas-imputation/src/genomic_tools_lib/file_formats/Parquet.py", line 218, in study_from_parquet _v = pq.ParquetFile(variants) File "/home/juditc/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 217, in init self.reader.open(source, use_memory_map=memory_map, File "pyarrow/_parquet.pyx", line 949, in pyarrow._parquet.ParquetReader.open File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit

As it seems that the problem is due to the size of the file, I've also tried to summary impute chr22, but it give me another error:

$ python3 summary-gwas-imputation/src/gwas_summary_imputation.py \

-by_region_file data/data/eur_ld.bed.gz \ -gwas_file outpu/harmonized_gwas/Summary_results_iPSYCH_PGC_10_euro_excluding_span_filtered_harmo.txt.gz \ -parquet_genotype data/data/reference_panel_1000G/chr22.variants.parquet \ -parquet_genotype_metadata data/data/reference_panel_1000G/variant_metadata.parquet \ -window 100000 \ -parsimony 7 \ -chromosome 22 \ -regularization 0.1 \ -frequency_filter 0.01 \ -sub_batches 10 \ -sub_batch 0 \ --standardise_dosages \ -output results_summary_imputation/ADHD_chr22_sb0_reg0.1_ff0.01_by_region.txt.gz INFO - Beginning process INFO - Creating context by variant INFO - Loading study INFO - Loading variants' parquet file INFO - Loading variants metadata Level 9 - Loading row group 21 INFO - Loading regions INFO - Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. INFO - NumExpr defaulting to 8 threads. Level 9 - Selecting target regions with specific chromosome Level 9 - Selecting target regions from sub-batches Level 9 - generating GWAS whitelist INFO - Loading gwas INFO - Acquiring filter tree for 35799 targets INFO - Processing gwas source Level 9 - Loaded 6667 GWAS variants Level 9 - Parsing GWAS Level 9 - Processing region 1/3 [15927607.0, 17193405.0] Level 8 - Roll out imputation Level 8 - Preparing data INFO - Error for region (22,15927607.0,17193405.0): AttributeError("'pyarrow.lib.ChunkedArray' object has no attribute 'name'") Level 9 - Processing region 2/3 [17193405.0, 17813322.0] Level 8 - Roll out imputation Level 8 - Preparing data INFO - Error for region (22,17193405.0,17813322.0): AttributeError("'pyarrow.lib.ChunkedArray' object has no attribute 'name'") Level 9 - Processing region 3/3 [17813322.0, 19924835.0] Level 8 - Roll out imputation Level 8 - Preparing data INFO - Error for region (22,17813322.0,19924835.0): AttributeError("'pyarrow.lib.ChunkedArray' object has no attribute 'name'") INFO - Finished in 26.57472068723291 seconds

Heroico commented 3 years ago

Hi there,

Unfortunately the code is stuck with an older version of pyarrow (0.11.0) We currently lack bandwidth to update this code. I recommend you use a conda environment with the same versions of libraries we use (here is the list of libraries and versions for most of MetaXcan)

jcabanad commented 3 years ago

Hi,

Thanks for your help, I have created a working environment using the commands in the tutorial:

conda env create -f /path/to/this/repo/software/conda_env.yaml conda activate imlabtools

It run the command, create me an empty output file and give me the same error than before, as an example:

Level 9 - Processing region 3/3 [17813322.0, 19924835.0] Level 8 - Roll out imputation Level 8 - Preparing data INFO - Error for region (22,17813322.0,19924835.0): AttributeError("'pyarrow.lib.ChunkedArray' object has no attribute 'name'")

Thank you for your help.

Best,

Judit

jcabanad commented 3 years ago

Finally I could install pyarrow 0.11.0 on Python/3.7.5 and imputation on chr22 works.

Thanks!

hakyimlab / MetaXcan

Error on GWAS summary stats imputation #122