hardingnj / xpclr

Code to compute the XP-CLR statistic to infer natural selection
MIT License
85 stars 26 forks source link

A few typo fixes and adding GDist to VCF loading #80

Closed James-S-Santangelo closed 2 years ago

James-S-Santangelo commented 2 years ago

Hey Nick,

Here is a quick PR with a few minor tweaks:

  1. I fixed a typo when loading variants using the h5py module. It was previously trying to import hdf5, which does not exist. (mentioned in #49)
  2. I fixed a small typo when loading the genetic distance info for Zarr databases, where it was trying to load the genetic distance from an h5py database rather than a Zarr database. I suspect this was just a minor copy-pasting bug.
  3. The biggest change allows for genetic distance information to be used when loading variants in VCF format (see #71). I added a couple snippets that loads the gdist info as a single column, sorted index array using scikit-allel. At present, this might throw a warning if GDist is None since scikit-allel can't find it in the VCF, but this doesn't seem to influence downstream processing.

Hope this helps!

James

James-S-Santangelo commented 2 years ago

I pushed a couple changes:

  1. Switching to fstring, as suggested
  2. A potential solution to cleanly ignore gdist in population two. Basically, just reassign a single variable (gdist) rather than creating gdist1 and gdist2 and having to ignore one of them.
hardingnj commented 2 years ago

A few bits I have to tidy up, but better to merge now rather than lose momentum.

Thanks so much for the effort here @James-S-Santangelo .

flyingicedragon commented 1 year ago

3. At present, this might throw a warning if GDist is None since scikit-allel can't find it in the VCF, but this doesn't seem to influence downstream processing.

The latest scikit-allel on PyPI is 1.3.5. However, it returns None after 1.12.0.

Changed in version 1.12.0: Now returns None if no variants are found in the VCF file or matching the requested region.

So if we install requirements by requirements.txt, genetic_dist will not be a list full of empty string.

I think this is the reason of #88 and #83.