mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

ZeroDivisionError (0variants [00:00, ?variants/s]) #245

Closed GaloGS closed 9 months ago

GaloGS commented 9 months ago

Dear pyseer developers,

Thanks a lot for this awesome tool, I am thrilled to use it with our dataset, but I am getting an error that I tried to fix without success.

I have a dataset of around 1500 samples. I have created the merged/compressed/indexed VCF using BCFtools as explained in the tutorials. I have also filtered it, so only variants observed in at least 10 samples at >90% allele frequency are included. The VCF seems correct, and the sample names in it, as well as in the phenotype file, do match.

However when I run the following command:

~/.local/bin/pyseer --vcf dataset.merged.filtered.vcf.gz --phenotypes pheno.txt --wg enet --save-vars ma_snps --save-model model.lasso --min-af 0.9 --alpha 1 > selected.txt

I get the following error:

0variants [00:00, ?variants/s]Traceback (most recent call last):
  File "/home/goig0000/.local/bin/pyseer", line 8, in <module>
    sys.exit(main())
  File "/home/goig0000/.local/lib/python3.7/site-packages/pyseer/__main__.py", line 620, in main
    options.max_missing, options.uncompressed)
  File "/home/goig0000/.local/lib/python3.7/site-packages/pyseer/enet.py", line 86, in load_all_vars
    sample_order)
  File "/home/goig0000/.local/lib/python3.7/site-packages/pyseer/input.py", line 434, in read_variant
    af = float(len(kstrains)) / len(all_strains)
ZeroDivisionError: float division by zero
0variants [00:00, ?variants/s]

I tried other combinations, and also the argument --max-missing 0.999 because I thought that maybe I had too many singletons, but still I have this problem.

Do you have any idea of why this may be happening?

Thank you very much, Galo

mgalardini commented 9 months ago

Thanks for reporting this crash; would you mind ending over your inputs so that I can try to see what is going on? You can also send the data privately to my email address.

GaloGS commented 9 months ago

Thank you very much for your prompt response and your help! I have sent you an e-mail with all the files.

mgalardini commented 9 months ago

Thanks for the files: with pyseer 1.3.11 I get the following output (variants omitted just in case):

$ pyseer --vcf file.vcf.gz --phenotypes pheno.txt --wg enet --save-vars ma_snps --save-model model.lasso --min-af 0.9 --alpha 1
Read 1540 phenotypes
Detected binary phenotype
[E::idx_find_and_load] Could not retrieve index file for file.vcf.gz'
Reading all variants
4309variants [00:10, 426.50variants/s]
Saved enet variants as ma_snps.pkl
Applying correlation filtering
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 1534.97variants/s]
Fitting elastic net to top 18 variants
Best penalty (lambda) from cross-validation: 7.60E-03
Best model deviance from cross-validation: 0.995 ± 2.50E-02
Best R^2 from cross-validation: -0.246
Finding and printing selected variants
[E::idx_find_and_load] Could not retrieve index file for 'file.vcf.gz'
variant af      filter-pvalue   lrt-pvalue      beta    notes
[...]
Saved enet model as model.lasso.pkl
4309 loaded variants
4291 pre-filtered variants
18 tested variants
6 printed variants

I had to change your phenotype file because it had withespace as a delimiter; a tab character is needed.

Can you try again with the latest version of pyseer and with an updated phenotype file?

Thanks

GaloGS commented 9 months ago

Dear Marco,

Sorry for the stupid mistake. Given the error message I did not realize that the problem could be in the phenotype file. I have substituted spaces by tabs and the program runs without problems.

Thanks you very much for your help!

Galo