Open aymanm opened 3 months ago
Hi @aymanm! That is an interesting idea, and one I don't believe we've explored thoroughly (@audrey-bollas correct me if I'm wrong). I think you could technically pull this off, but you'd need to compute feature importance on the training data and then take the n-top (e.g. 100 variants) features for each population and filter your vcfs doing that. Likely, the model would still be performant, especially with WGS data. WES you might lose some accuracy as you'd be at the mercy of the WES kit having probes that cover those variants.
thanks for the clarification. i might give this a try, i'll submit a pull request if am successful.
Hello @andreirajkovic! I had a similar question to @aymanm. I have a very large VCF (914GB) and was wondering if there was a suggested course of action for this. I have a 126 GB memory system and was still not able to get it to run.
Thanks in advance!
Thanks for your great work. I have been testing this tool the last couple of days and wondering if there are optimal sites to select /subset for the input vcf ? A large vcf would require lots of memory and therefore a minimal size vcf input that contains optimal markers would be great. would appreciate your advice on this. I believe one other user asked a similar question in the issues.
thanks again