nch-igm / snvstory

Rapid and accurate ancestry inference using SNVs.
BSD 3-Clause "New" or "Revised" License
12 stars 7 forks source link

sites to extract for input wgs vcf #17

Open aymanm opened 3 months ago

aymanm commented 3 months ago

Thanks for your great work. I have been testing this tool the last couple of days and wondering if there are optimal sites to select /subset for the input vcf ? A large vcf would require lots of memory and therefore a minimal size vcf input that contains optimal markers would be great. would appreciate your advice on this. I believe one other user asked a similar question in the issues.

thanks again

andreirajkovic commented 3 months ago

Hi @aymanm! That is an interesting idea, and one I don't believe we've explored thoroughly (@audrey-bollas correct me if I'm wrong). I think you could technically pull this off, but you'd need to compute feature importance on the training data and then take the n-top (e.g. 100 variants) features for each population and filter your vcfs doing that. Likely, the model would still be performant, especially with WGS data. WES you might lose some accuracy as you'd be at the mercy of the WES kit having probes that cover those variants.

aymanm commented 3 months ago

thanks for the clarification. i might give this a try, i'll submit a pull request if am successful.

RaviBot commented 3 weeks ago

Hello @andreirajkovic! I had a similar question to @aymanm. I have a very large VCF (914GB) and was wondering if there was a suggested course of action for this. I have a 126 GB memory system and was still not able to get it to run.

Thanks in advance!