ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

KeyError: nan #31

Closed djeffares closed 3 years ago

djeffares commented 3 years ago

Dear Kieran Samuk

Thanks for making pixy, it is a very valuable tool.

I am having an error with running pixy that I can't diagnose. I would very much appreciate some guidance. At the moment pixy starts to calculate stats, then fails with: KeyError: nan. The only outputs file (pixy_tmpfile_8ba3a11fae0041409dabd3d62ea4f743.tmp) is blank. All the details are below.

best wishes, Daniel Jeffares

I am using an all sites bgzipped, tabix-indexed vcf created with freebayes: freebayes -f hg38_pvivP01.fa bams/*bam \ --min-mapping-quality 30 \ --min-coverage 10 \ --strict-vcf \ --gvcf \ --report-monomorphic \ --vcf PvP01_01_v1.vcf &

I am running pixy on a mac. My command is: pixy --stats pi fst dxy \ --vcf PvP01_01_v1.vcf.gz \ --populations popfile2.txt \ --window_size 10000 \ --output_prefix pixytest

The error code I get is: [pixy] pixy 1.0.4.beta1 [pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters... [pixy] Checking write access...OK [pixy] Checking CPU configuration...OK [pixy] Checking for invariant sites...OK [pixy] Checking chromosome data...OK [pixy] Checking intervals/sites...OK [pixy] Checking sample data...OK [pixy] All initial checks past!

[pixy] Preparing for calculation of summary statistics: pi, fst, dxy [pixy] Data set contains 3 population(s), 1 chromosome(s), and 90 sample(s) [pixy] Window size: 10000 bp

[pixy] Started calculations at 18:10:55 on 2021-05-07 [pixy] Using 1 out of 8 available CPU cores

[pixy] Processing chromosome/contig PvP01_01_v1... [pixy] Calculating statistics for region PvP01_01_v1:1-156579... KeyError: nan

ksamuk commented 3 years ago

Hi there! It is not immediately clear what caused that error. Would it be possible to send me a subset (~10000 lines) of your VCF, and I'll see if I can diagnose the issue on my end?

djeffares commented 3 years ago

Hi Kieran

Thanks so much for getting back to me. I tried with some vcfs made using bcftools this time as described in the docs:

bcftools mpileup -f hg38_pvivP01.fa -b bamlist --regions PvP01_01_v1:10000-20000 | \ bcftools call -m -Oz -f GQ -o PvP01_01_v1.10000-20000.bcftools.vcf.gz --output-type z &

And another with ploidy 1 (as my species is haploid). vcfs attached.

Here are the files. This time I get a more informative error:

[pixy] pixy 1.0.4.beta1 [pixy] See documentation at https://pixy.readthedocs.io/en/latest/ /Users/ucbtdje/opt/anaconda3/lib/python3.8/site-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">\n' warnings.warn('invalid INFO header: %r' % header)

[pixy] Validating VCF and input parameters... [pixy] Checking write access...OK [pixy] Checking CPU configuration...OK [pixy] Checking for invariant sites...OK [pixy] Checking chromosome data...OK [pixy] Checking intervals/sites...OK [pixy] Checking sample data...OK [pixy] All initial checks past!

[pixy] Preparing for calculation of summary statistics: pi, fst, dxy [pixy] Data set contains 3 population(s), 1 chromosome(s), and 90 sample(s) [pixy] Window size: 1000 bp

[pixy] Started calculations at 11:44:51 on 2021-05-08 [pixy] Using 1 out of 8 available CPU cores

[pixy] Processing chromosome/contig PvP01_01_v1... [pixy] Calculating statistics for region PvP01_01_v1:1-20000... /Users/ucbtdje/opt/anaconda3/lib/python3.8/site-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">\n' warnings.warn('invalid INFO header: %r' % header) KeyError: nan

best wishes /Dan

On 7 May 2021, at 23:51, Kieran Samuk @.***> wrote:

Hi there! It is not immediately clear what caused that error. Would it be possible to send me a subset (~10000 lines) of your VCF, and I'll see if I can diagnose the issue on my end?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ksamuk/pixy/issues/31#issuecomment-834836447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD2HHSM7ZJ3O2HVA5HJMDRLTMRVFXANCNFSM44KUDMQA.

ksamuk commented 3 years ago

Hi Dan, I am not seeing your attachments, can you provide a dropbox/drive link or email them to me at ksamuk@gmail.com? I am guessing the problem has something to do with how haploid genotypes are represented.

djeffares commented 3 years ago

Thanks - I shared the vcfs and indexes.

best wishes /Dan

On 8 May 2021, at 15:39, Kieran Samuk @.***> wrote:

Hi Dan, I am not seeing your attachments, can you provide a dropbox/drive link or email them to me at @. @.>? I am guessing the problem has something to do with how haploid genotypes are represented.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ksamuk/pixy/issues/31#issuecomment-835387560, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD2HHSPTO5FNRSF5CQTB3CDTMVEJ5ANCNFSM44KUDMQA.

ksamuk commented 3 years ago

Hi Dan,

Just ran the sample data you sent me, and I haven't been able to reproduce the error. Can you confirm that you get the error with the sample data? If not, can you send me some data that does produce the error?

Edit: Might need the populations file also, thanks!

djeffares commented 3 years ago

Hi Kieran

I just ran pixy with these two files (one haploid encoded, one diploid, same data). These are the ones I shared (links and commands below). Still the same error. Could it be my installation, or my mac? If so, I’m happy to try on my universities linux server (once they do the install).

What do you think? I don’t want to waste your time unnecessarily.

Files: PvP01_01_v1.10000-20000.bcftools.haploid.vcf.gz PvP01_01_v1.10000-20000.bcftools.vcf.gz

Links: https://drive.google.com/file/d/1Jc4u4OZIqvP7N1gfZVSMq8GVYRWl8GP3/view?usp=sharing https://drive.google.com/file/d/1LsyWlPo9lwfMuvzPRRQvUWbAuZo0S5Wf/view?usp=sharing https://drive.google.com/file/d/1ZTGWgSf0VQmmdy8qGCfom8OjQnLv8g5s/view?usp=sharing https://drive.google.com/file/d/1sjWMLpTwjJcT0OD3bao3IMhbkmBBXdKe/view?usp=sharing

pixy with diploid VCF

pixy --stats pi fst dxy \ --vcf PvP01_01_v1.10000-20000.bcftools.vcf.gz \ --populations popfile2.txt \ --window_size 1000 \ --output_prefix Pv01_01_v1.10000-20000

pixy with haploid VCF

pixy --stats pi fst dxy \ --vcf PvP01_01_v1.10000-20000.bcftools.haploid.vcf.gz \ --populations popfile2.txt \ --window_size 1000 \ --output_prefix Pv01_01_v1.10000-20000.haploid

best wishes /Dan

On 9 May 2021, at 17:45, Kieran Samuk @.***> wrote:

Hi Dan,

Just ran the sample data you sent me, and I haven't been able to reproduce the error. Can you confirm that you get the error with the sample data? If not, can you send me some data that does produce the error?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ksamuk commented 3 years ago

Hi Dan,

Very mysterious! I will see if I can get the error on my Mac today. Can you possibly share your populations file?

djeffares commented 3 years ago

Hi Kieran

Here is my populations file. Our IT support is about to install pixy on a server, it may be that it works there.

best wishes /Dan

On 10 May 2021, at 15:17, Kieran Samuk @.***> wrote:

Hi Dan,

Very mysterious! I will see if I can get the error on my Mac today. Can you possibly share your populations file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ksamuk/pixy/issues/31#issuecomment-836756168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD2HHSMUSOK2A47CDPZR2P3TM7THTANCNFSM44KUDMQA.

M_17_5636 1 M_17_5637 1 M_17_5638 1 M_17_5639 1 M_18_0041 1 M_18_0042 1 M_18_0043 1 M_18_0044 1 M_18_0045 1 M_18_0046 1 M_18_0047 1 M_18_0048 1 M_18_0049 1 M_18_0050 1 M_18_0051 1 M_18_0052 1 M_18_0053 1 M_18_0054 1 M_18_0055 1 M_18_0056 1 M_18_0057 1 SRR1562517 2 SRR1562518 2 SRR1562520 2 SRR1562526 2 SRR1562567 2 SRR1562605 2 SRR1562613 2 SRR1562614 2 SRR1562624 2 SRR1562844 2 SRR1562851 2 SRR1562939 2 SRR1562958 2 SRR1562973 2 SRR1562974 2 SRR1564529 2 SRR1564547 2 SRR1564577 2 SRR1564665 2 SRR1567977 2 SRR1568103 2 SRR1568105 2 SRR1568106 2 SRR1568114 2 SRR1568115 2 SRR1568126 2 SRR1568128 2 SRR1568149 2 SRR1568150 2 SRR1568162 2 SRR1568169 2 SRR1568173 2 SRR1568177 2 SRR1568178 2 SRR1568180 2 SRR1568192 2 SRR1568193 2 SRR1568201 2 SRR1568204 2 SRR1568216 2 SRR1568218 2 SRR1738015 2 SRR1738039 2 SRR1738040 2 SRR1740119 2 SRR1740379 2 SRR1740392 2 SRR1740488 2 SRR1759047 2 SRR1759307 2 SRR1759523 2 SRR1759592 2 SRR332408 2 SRR332409 2 SRR332410 2 SRR332559 2 SRR332560 2 SRR332561 2 SRR332563 2 SRR332570 2 SRR340133 2 SRR340134 2 SRR570031 2 SRR572650 2 SRR572651 2 SRR575087 2 SRR575089 2 SRR828416 2 Transfer.PvP01_00_99.final

ksamuk commented 3 years ago

Hi Dan, is the "Transfer.PvP01_00_99.final" line in your populations file? If so, it doesn't have a population field. Can you try running without that line (or assign it to a population)?

djeffares commented 3 years ago

Hi Kieran

That was it! Works now with both haploid and diploid vcfs, made with either bcftools or freebayes.

btw: the data is from this paper & reviews wanted dxy (rather than Fst) and were concerned about coverage, so pixy is perfect. thanks!

https://www.biorxiv.org/content/10.1101/841171v3 https://www.biorxiv.org/content/10.1101/841171v3

best wishes /Dan

On 10 May 2021, at 15:37, Kieran Samuk @.***> wrote:

Hi Dan, is the "Transfer.PvP01_00_99.final" line in your populations file? If so, it doesn't have a population field. Can you try running without that line (or assign it to a population)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ksamuk/pixy/issues/31#issuecomment-836778950, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD2HHSJU5F762SLZUBR4LCTTM7VR7ANCNFSM44KUDMQA.

ksamuk commented 3 years ago

Awesome! I'll work on adding a more informative error message to catch that in the future. Thanks for your help tracking this down.