jfear / ncbi_remap

This is the drosSRA project, where we are remapping all Drosophila melanogaster RNA-seq data to FlyBase release 6 and updating annotations.
2 stars 1 forks source link

Read Length Downstream Analysis #48

Closed jfear closed 7 years ago

jfear commented 7 years ago

Story

Read length is an important attribute because it can improve specificity of alignment. Read length has increased over time. Initially most studies used a read length of ~35bp, but as sequencing became cheaper and less error prone read length has increased to 50+ bp. The workflow removes reads that are shorter than <25 bp after trimming. Here I look at how read lengths differs across experiments and identify experiments with short read lengths. Read length can also help inform the technology used for sequencing. Illumina is typically ranges between 36bp to 150bp, while other technologies can go into kilobases.

The vast majority of samples (n=24,677) have a read length between 25-160bp. With most samples having a read length of 51 (single end) or 2x95 (pair end). There are 890 samples that are too short (< 25bp), and 193 samples that are really long (≥ 300 bp).

Output

Questions and Tasks

bin min max median count
(20, 40] 21 40 36 3,845
(40, 60] 41 60 51 8,066
(60, 80] 61 80 76 5,488
(80, 1000] 81 100 95 6,552
(100, 120] 101 120 101 1,382
(120, 140] 123 140 125 221
(140, 160] 141 160 151 239
(160, 300] 164 298 240 344
(300, 2000] 395 1,852 536 171
(2000, inf] 2,097 6,340 4,026 22
Machine count
454 GS 1
454 GS 20 4
454 GS FLX 29
454 GS FLX Titanium 39
454 GS FLX+ 1
454 GS Junior 250
Illumina MiSeq 19
Ion Torrent PGM 3
MinION 1
PacBio RS 1

Definition of done