malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Add plink converter function #515

Open tristanpwdennis opened 3 months ago

tristanpwdennis commented 3 months ago
review-notebook-app[bot] commented 3 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

sanjaynagi commented 3 months ago

Hey Tristan. Nice work!

Ill save comments for now but FYI - when you add notebooks to malariagen_data, make sure you have cleared all outputs, otherwise they can become quite hefty in size and then the repo balloons in size over time (all of it is stored in git history).

tristanpwdennis commented 3 months ago

I've found the source of the AssertionError (also see issue #516) - something to do with how dask.array.map_blocks computes variant_allele at line 1629 of snp_data.py.

I haven't managed to get to the bottom of it yet but in this PR there's a temporary fix that just applies apply_allele_mapping to an in-memory np array of variant_allele, and I've now added biallelic_snp_calls to to_plink.py instead of calling snp_calls and thinning them manually.