Open Lswhiteh opened 4 years ago
Wow this looks like great work -- thanks for doing the deep dive here! That grid search was as I'm sure you noticed an... inelegant implementation so I think any refactor would be a step forward. I'm moving this weekend so will need a couple weeks to go through it in detail. Will update when we're somewhat settled in SF.
No worries at all, I'm squashing some bugs I missed, stuff like making sure validation data is always supplied even if the split percentage is smaller than the batch size. If I end up finding any more I'll shoot you another PR, no rush obviously.
Good luck with the move!
also, here's a link to the set of 100,000 chr1 HGDP SNPs we used in the preprint: https://www.dropbox.com/sh/amqujkodw4ccqjb/AACyOCVpxPEUQLQcPaq9KAgFa?dl=0
wow awesome @Lswhiteh!
@cjbattey it would be great if we could shove this SNP set into the repo as a test
Hey all, finished with the implementation of the generators. I went ahead and refactored the entire codebase so I could work with it a little easier, if you like the changes that's great, if you just want the generators it should be pretty simple to throw in without functionalizing everything.
Couple notes:
I'm currently running on about half a million SNPs, ramping up testing as I go with the default grid search values. I have it limited to one GPU, and allowed GPU growth along with clearing the GPU env (which is what was already implemented), together those should do the trick but I'll continue to test. If it doesn't solve the tensor problem, not sure there's much more I can do about that, but this at least frees up more VRAM than there was before since the entire dataset isn't in memory.
The current implementation of the generators relies on writing the train/test partition split to separate HDF5 files so I can keep the filestream open during the entire generation process. It's a little clunky since the stream close is outside of the scope it's called, but it works and I can't think of a better way off the top of my head.
Along with the above, the major memory bottleneck here will still be converting VCF into arrays in the beginning. I've messed around with converting those to HDF5 files right off the bat using allel's vcf_to_hdf5() function, but the parameterization of that is super weird and I'm not used to working with vcf files. If this could be implemented, you could theoretically run this on a super low-memory system, since this is now the non-vRAM bottleneck.
I updated the grid search to be more extensible, now it uses a dict-style structure that I've had good luck with in the past. Pretty simple change but I noticed you wanted to give it more than 2 parameters to search through, should be super simple to throw some more in now by just adding them to the dict and updating the model building function.
I haven't tested the PCA or plotting yet, but I'm going to assume they're good to go since I didn't change any of the output methodology, just the way to get there.
As an aside, do you have some testing data you can run this on to see if it fits your expectations? I ran it on some small testing vcfs and some 1000Genomes data I had to make sure it was consistent with the master fork results on the same data, and it seems to be the same but you're going to know best on this one.
Sorry for the wall of text, feel free to keep/drop whatever portions of the change you like/don't of course. If you have any questions I'm happy to talk more about it!