kr-colab / popvae

genotype dimensionality reduction with a VAE
Other
41 stars 9 forks source link

Generator implementation, refactor, documentation #3

Open Lswhiteh opened 4 years ago

Lswhiteh commented 4 years ago

Hey all, finished with the implementation of the generators. I went ahead and refactored the entire codebase so I could work with it a little easier, if you like the changes that's great, if you just want the generators it should be pretty simple to throw in without functionalizing everything.

Couple notes:

As an aside, do you have some testing data you can run this on to see if it fits your expectations? I ran it on some small testing vcfs and some 1000Genomes data I had to make sure it was consistent with the master fork results on the same data, and it seems to be the same but you're going to know best on this one.

Sorry for the wall of text, feel free to keep/drop whatever portions of the change you like/don't of course. If you have any questions I'm happy to talk more about it!

cjbattey commented 4 years ago

Wow this looks like great work -- thanks for doing the deep dive here! That grid search was as I'm sure you noticed an... inelegant implementation so I think any refactor would be a step forward. I'm moving this weekend so will need a couple weeks to go through it in detail. Will update when we're somewhat settled in SF.

Lswhiteh commented 4 years ago

No worries at all, I'm squashing some bugs I missed, stuff like making sure validation data is always supplied even if the split percentage is smaller than the batch size. If I end up finding any more I'll shoot you another PR, no rush obviously.

Good luck with the move!

cjbattey commented 4 years ago

also, here's a link to the set of 100,000 chr1 HGDP SNPs we used in the preprint: https://www.dropbox.com/sh/amqujkodw4ccqjb/AACyOCVpxPEUQLQcPaq9KAgFa?dl=0

andrewkern commented 4 years ago

wow awesome @Lswhiteh!

@cjbattey it would be great if we could shove this SNP set into the repo as a test