Open pettyalex opened 1 year ago
Yes, it would! I have been wanting to add support for VCF, chromosome X, etc, if I get the chance. I will also try to refactor the code as well. The real issue with adding the VCF is the genetic distance data, which is typically not present in that file format. Using a default genetic map file requires some research on my side. Further, doing so would require the refactoring that I am planning to do.
Other tools like hap-ibd tend to use a VCF file plus a genetic map, although you don't actually need to, given that modern phasing tools like shapeit4 output VCFs that include genetic map position both physically and in centimorgans, so for our usage case the VCFs would always have sufficient info to not need a map. Given that hap-ibd uses VCF+map, that seems like a good place to start.
What goals did you have to refactor? My thoughts for adding VCF support were:
filereader
s for every individual, because VCFs have individuals in columns and variants in rows.LSH_Slave::run
across the threadpool as it currently exists.This is why I was looking at making the filereader constructor's parsing faster. While I'm at it, the rest of my wishlist includes:
Corpus::integrate
is a lot of the CPU time, but it has a lock on agg_poiter
the whole time? I haven't dug into this too much, but is there only one agg_pointer
for the whole application? I fully agree with your thought for adding VCF support and compressed I/O. Some of the code was originally implemented based on different design decisions; I never got the chance to refactor them. If I remember correctly, there is only one agg_pointer instance. The idea is to use the I/O queue time we had on our shared cluster to calculate the next batch of things to write.
It'd be easier to use iLash with other tooling if it could directly accept VCF input. At a glance, it looks like it might be possible to do this with only a new filereader implementation.