VCF Input Support - Githubissues

pettyalex commented 1 year ago

It'd be easier to use iLash with other tooling if it could directly accept VCF input. At a glance, it looks like it might be possible to do this with only a new filereader implementation.

roohy commented 1 year ago

Yes, it would! I have been wanting to add support for VCF, chromosome X, etc, if I get the chance. I will also try to refactor the code as well. The real issue with adding the VCF is the genetic distance data, which is typically not present in that file format. Using a default genetic map file requires some research on my side. Further, doing so would require the refactoring that I am planning to do.

pettyalex commented 1 year ago

Other tools like hap-ibd tend to use a VCF file plus a genetic map, although you don't actually need to, given that modern phasing tools like shapeit4 output VCFs that include genetic map position both physically and in centimorgans, so for our usage case the VCFs would always have sufficient info to not need a map. Given that hap-ibd uses VCF+map, that seems like a good place to start.

What goals did you have to refactor? My thoughts for adding VCF support were:

Make reading file input single-threaded so that one thread can build out filereaders for every individual, because VCFs have individuals in columns and variants in rows.
Do the rest of LSH_Slave::run across the threadpool as it currently exists.

This is why I was looking at making the filereader constructor's parsing faster. While I'm at it, the rest of my wishlist includes:

Compressed I/O
Why is the writing being done across a threadpool? Writing should be I/O blocked and I bet one thread can write fast enough.
Overall thread utilization is pretty low. It looks like Corpus::integrate is a lot of the CPU time, but it has a lock on agg_poiter the whole time? I haven't dug into this too much, but is there only one agg_pointer for the whole application?

roohy commented 1 year ago

I fully agree with your thought for adding VCF support and compressed I/O. Some of the code was originally implemented based on different design decisions; I never got the chance to refactor them. If I remember correctly, there is only one agg_pointer instance. The idea is to use the I/O queue time we had on our shared cluster to calculate the next batch of things to write.

roohy / iLASH

VCF Input Support #5