lh3 / wgsim

Reads simulator
258 stars 91 forks source link

Accounting for genome region specific coverage biases #17

Open AyushSaxena opened 6 years ago

AyushSaxena commented 6 years ago

We have observed in our data (generated through multiple different Illumina machines and library prep methods), that local coverage density varies across the genome, predictably so, across all genotypes. When we calculate read coverage by bin size in any two genotypes, we observe a correlation between the two read coverage in two genotypes in a specific bin. Ideally, if sampling across the genome is random, we should see no correlation. Also, in the real data, the correlation coefficient stays the same regardless of the bin size.

Reads produced through wg-sim also produce this correlation, albeit the correlation coefficient is smaller, and approaches the correlation coefficient of real data at bin sizes of >100kb. Is there a way to manipulate this correlation coefficient ourselves?

Ayush