omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
88 stars 22 forks source link

Error when running polyfun with custom created annotations #106

Closed mkoromina closed 2 years ago

mkoromina commented 2 years ago

Hi Omer,

An issue related to running polyfun.py with custom made annotations:

However, the script exits with the following error: raise OSError('Not a gzipped file (%r)' % magic) OSError: Not a gzipped file (b'PA')

After trying to unzip and gzip again the weight files and running polyfun, the following message occurs: File "pandas/_libs/parsers.pyx", line 537, in pandas._libs.parsers.TextReader.cinit File "pandas/_libs/parsers.pyx", line 740, in pandas._libs.parsers.TextReader._get_header UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Any ideas on what could be wrong here? Many thanks, Maria

omerwe commented 2 years ago

Hi @mkoromina,

Can you run the test script (test_polyfun.py) without a problem? If you can, there might be an illegal character in one of your files... It's tricky to debug thus, but maybe you can e.g. take only the first 100 SNPs in each file and see if it works then. Then you could gradually add more and more lines until you find the line that causes the problem.

Sorry I can't help more. If you can create a small reproducible example that you can send to me (oweissbrod@hsph.harvard.edu) I can take a look.

mkoromina commented 2 years ago

Hi @omerwe,

Thank you for your quick reply! It seems that this issue is resolved as soon as I converted all .gz files to .parquet ones.

May I take this opportunity to highlight another issue though: As soon as I use the exact same files to run functional enrichment analysis (.5 on your wiki page) and although the analysis begins and seems to proceed with merging with regressions SNP LD, the following issue occurs ``` ('File "/path/to/polyfun/ldsc_polyfun/jackknife.py", line 110, in init raise ValueError('Must specify either n_blocks are separators.') ValueError: Must specify either n_blocks are separators.

The script I am running is: ./ldsc.py \ --out /path/to/output/ldsc-enrich-prefix \ --h2 /path/to/sumstats \ --ref-ld-chr /path/to/ld-scores-prefix \ --w-ld-chr /path/to/weights-prefix \ --overlap-annot \ --not-M-5-50 \ And the flag I tried to add to troubleshoot was --n-blocks 200 but it seemed to make not difference.

Happy to send a small reproducible example, but I was simply wondering if this is an issue with jackknife and python pandas updates (which affected both analyses as the inputs could be the same)!

Many thanks, Maria

omerwe commented 2 years ago

@mkoromina can you reproduce this problem using the example data provided with PolyFun? If not, do you think you could create a small reproducible example and send to me (oweissbrod@hsph.harvard.edu)? This should help diagnose the problem.

mkoromina commented 2 years ago

@omerwe, thank you very much for this. Kindly let me check if I am allowed to share some of the data and I will come back to you about the reproducible example shortly within this week (asap!). In the meantime, may I note this: there are no issues when I am running ldsc.py (step 5. in your wiki) with the example data. So the issue may lie in my data: do you think that the error may lie to the fact that the LDSCORE and weight files have only one annotation category in them?

Many thanks in advance! Maria

omerwe commented 2 years ago

@mkoromina can it be that (almost) all SNPs were filtered out during preprocessing? Can you please send the full output of the command that you ran? There may be some clues in there.

mkoromina commented 2 years ago

hi @omerwe, sure please find attached the full log session after running the above mentioned command:

Beginning analysis at Tue Apr 26 10:50:42 2022 Reading summary statistics from /path/to/stats/stats_neff.munged.parquet ... Read summary statistics for 7513568 SNPs. Reading reference panel LD Score from /path/to/ldscores.[1-22] ... Read reference panel LD Scores for 14596097 SNPs. Reading regression weight LD Score from /path/to/weights.[1-22] ... Read regression weight LD Scores for 30339720 SNPs. After merging with reference panel LD, 7513511 SNPs remain. After merging with regression SNP LD, 7513511 SNPs remain. Using two-step estimator with cutoff at 30. Traceback (most recent call last): File "ldsc.py", line 690, in sumstats.estimate_h2(args, log) File "/path/to/polyfun/ldsc_polyfun/sumstats.py", line 337, in estimate_h2 nnls_exact=args.nnls_exact File "/path/to/polyfun/ldsc_polyfun/regressions.py", line 414, in init nnls_exact=nnls_exact File "/path/to/polyfun/ldsc_polyfun/regressions.py", line 216, in init x1, yp1, update_func1, n_blocks, slow=slow, w=initial_w1) File "/path/to/polyfun/ldsc_polyfun/irwls.py", line 67, in init x, y, update_func, n_blocks, w, slow=slow, separators=separators) File "/path/polyfun/ldsc_polyfun/irwls.py", line 128, in irwls x, y, n_blocks, separators=separators) File "/path/to/polyfun/ldsc_polyfun/jackknife.py", line 329, in init Jackknife.init(self, x, y, n_blocks, separators) File "/path/to/polyfun/ldsc_polyfun/jackknife.py", line 110, in init raise ValueError('Must specify either n_blocks are separators.') ValueError: Must specify either n_blocks are separators.

Analysis finished at Tue Apr 26 10:55:44 2022 Total time elapsed: 5.0m:1.9399999999999977

Many thanks in advance!

omerwe commented 2 years ago

@mkoromina thanks, this was helpful. I believe I found and fixed the error. Can you please git pull and retry?

I hope it works, but if I understand correctly you use only a single annotation. The code had a bug under this scenario. But in general we don't recommend running LDSC with only a single annotation. At the very least I would use two annotations: A "base" annotation (value of 1 for all SNPs) and another annotation of your interest. But in any case, please let me know if the problem is fixed.

mkoromina commented 2 years ago

Many thanks @omerwe! I can confirm that it is solved now; truly appreciate your time and effort on this! 👍