odelaneau / shapeit4

Segmented HAPlotype Estimation and Imputation Tool
MIT License
90 stars 18 forks source link

Reading in files and genetic maps is done sequentially, and could be parallelized across threads #79

Open pettyalex opened 2 years ago

pettyalex commented 2 years ago

I had a malformed genetic map file, and while diagnosing and fixing this, I noticed that shapeit4 currently reads in input files, then reads in the genetic map sequentially. It took substantial amounts of time before shapeit4 would reach reading in the genetic map, meaning that it took a long time for me to hit this error, notice, and diagnose. Peeking at the code, these two operations are fully independent and could be done simultaneously on multiple threads, making the pre-phasing initialization happen much faster.

The only major complicaiton that I see would be keeping logging output ordered. The file read operations themselves are into fully independent structures, so there's no need to worry about read or write contention that I see.

odelaneau commented 2 years ago

Hi,

Thanks for the suggestion. However, multi-threading IO is usually not that efficient, especially when you read from the same disk...

Best,

O