rhpvorderman / sequali

Fast sequencing data quality metrics
GNU Affero General Public License v3.0
9 stars 0 forks source link

"Solve" the insert size distribution in order to extrapolate the estimate. #174

Open rhpvorderman opened 1 month ago

rhpvorderman commented 1 month ago

Currently the found insert sizes are displayed. These form a regular statistical distribution. It should be possible to infer a distribution since usually the peak of the distribution is visible. Since one half of the distribution is known, it should be theoretically possible to solve for the parameters and estimate the tail end of the distribution.

rhpvorderman commented 1 month ago

Ioannis suggested I try a few distributions and check with https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

rhpvorderman commented 1 month ago

Apparently the protocol consists of several steps, with first selecting the desired median, then removing smaller and larger inserts in separate steps. This may affect the mapping to a distribution.