jfear / ncbi_remap

This is the drosSRA project, where we are remapping all Drosophila melanogaster RNA-seq data to FlyBase release 6 and updating annotations.
2 stars 1 forks source link

Estimate Effective Genome Size #53

Closed jfear closed 7 years ago

jfear commented 7 years ago

Story

When building BigWigs, I would like to use the normalize to 1x feature in deeptools. This feature needs to have an estimate of the effective genome size for dm6. Their estimate for dm3 is 121,400,000. I expect that dm6 is slightly larger than this. To estimate effective genome size Deeptools has a few suggestions including:

Use bamCoverage If you have a sample where you expect the genome to be covered completely, e.g. from genome sequencing, a very trivial solution is to use bamCoverage with a bin size of 1 bp and the --outFileFormat option set to ‘bedgraph’. You can then count the number of non-Zero bins (bases) which will indicate the mappable genome size for this specific sample.

I have identified a set of samples that are WGS. By aligning and merging these samples I hope to get a reasonable estimate of effective genome size for dm6.

Questions and Tasks

Definition of done

Summary

I ended up randomly selecting 30 WGS samples for this estimate. I had started using all WGS samples, but this was taking too long and I felt would not add additional precision. One thing to remember is that effective genome size is dependent on which aligner is used and the read length. The samples selected have a range of read lengths from 35bp to 150bp and I used similar setting as in the alignment workflow. For the effective genome size estimate I excluded scaffolds because I felt they just make the problem a little more complicated. For the normalization in the workflow I am excluded chrX and because they could because they could be variable between sexes.