ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

sites_file argument not working #48

Closed JesseGarcia562 closed 2 years ago

JesseGarcia562 commented 2 years ago

Describe the bug A clear and concise description of what the bug is. I'm trying to use this program with the "--sites_file" argument but I keep getting the error: "Exception: [pixy] ERROR: In the absence of a BED file, a --window_size must be specified." I think this has to do with the error checking in the code in the lines 659-662 of pixy/core.py. I have no --bed_file in my command (like the tutorial for sites_file suggests) and when I try setting window_size 1 it gives me "pandas.errors.EmptyDataError: No columns to parse from file". Without a bed_file I think the code goes straight into checking "if args.window_size is None:" when I think it needs to allow for a sites_file argument

A reproducible example of the bug Please include the following so we can debug the issue: (1) The full command you used to run pixy, including all arguments pixy --stats pi \ --vcf 2018wgs3.ef.rmIndelRepeatsStar.chr4.vcf.gz \ --populations populations.txt \ --sites_file chr4_gene_locations.txt

I can email you a google drive link with my vcf/populations/sites file if needed. OS information I'm using Mac OS X

ksamuk commented 2 years ago

Hi Jesse,

The logical flow of all those checks is confusing but I think this is working as intended: without a bed file (whether or not you have a sites file), you'll need to specify a window size. Otherwise, there will be no way for pixy to know the intervals over which to calculate your summary stats. So, if you want a window size of 1, you should specify --window_size 1 (as you did!).

So then on to the next problem, your pandas error. I can't reproduce that on my end, can you post your chr4_gene_locations.txt? That error might also be from your populations.txt file. Have a look at those two files, make sure they are valid tab-separated files etc. (or post them here). You could also try rerunning with the --debug flag to get a traceback of the pandas error.

Let me know how it goes!

JesseGarcia562 commented 2 years ago

Hi,

I checked the populations file and sites_file and verified that they were tab delimited. I'm including them here at this link: https://drive.google.com/drive/folders/1ex8zMsylNyIfuIHuX3Uuq0ORUnxI8Oj7?usp=sharing . Thanks for your help!

ksamuk commented 2 years ago

Hi Jesse,

Thanks for sending me your data! Interestingly, I wasn't able to reproduce your error. The calculations were slow (single-site mode is still very slow), but they did complete (let me know if you'd like the output file). While I was at it, I added some new optimizations that will speed this type of analysis up in the future.

Re: your error, a few questions:

  1. What OS are you running pixy on?
  2. There is an extra tab at the end of the second line of your populations.txt file. It didn't seem to affect anything for me, but I wonder if this might be connected to the problem.
  3. Can you re-run your analysis with the --debug flag? and paste the output here?
ksamuk commented 2 years ago

Just following up here, once your input file issue is resolved, it would probably be worth updating to the new version 1.2.5.beta1 on conda. The single sites + sites file combination you are doing is much faster in the new version.

JesseGarcia562 commented 2 years ago

Updating my pixy to the latest on conda seemed to fix everything! I can now use the sites_file argument.