Read length is an important attribute because it can improve specificity of alignment. Read length has increased over time. Initially most studies used a read length of ~35bp, but as sequencing became cheaper and less error prone read length has increased to 50+ bp. The workflow removes reads that are shorter than <25 bp after trimming. Here I look at how read lengths differs across experiments and identify experiments with short read lengths. Read length can also help inform the technology used for sequencing. Illumina is typically ranges between 36bp to 150bp, while other technologies can go into kilobases.
The vast majority of samples (n=24,677) have a read length between 25-160bp. With most samples having a read length of 51 (single end) or 2x95 (pair end). There are 890 samples that are too short (< 25bp), and 193 samples that are really long (≥ 300 bp).
Output
Distribution plot of read lengths separated by Single End and Pair End data. I focus on reads ≤ 160 bp for plots.
Output table with flags fitting different criteria. ../../output/read_length_downstream_analysis
flag_too_short True if the read length < 25 bp
flag_short True if 25 bp ≤ read length < 45 bp
flag_good True if 45 bp ≤ read length < 160 bp
flag_longTrue if 160 bp ≤ read length < 300 bp
flag_really_longTrue if 300 bp ≤ read length
Questions and Tasks
[x] What is the distribution of read length?
Of the 27,138 samples, the majority (n=24,677) have reads between 25-160bp. The most frequent read length in Single End data was 51 bp, while the most frequent read length in Pair End data was 95 bp.
bin
min
max
median
count
(20, 40]
21
40
36
3,845
(40, 60]
41
60
51
8,066
(60, 80]
61
80
76
5,488
(80, 1000]
81
100
95
6,552
(100, 120]
101
120
101
1,382
(120, 140]
123
140
125
221
(140, 160]
141
160
151
239
(160, 300]
164
298
240
344
(300, 2000]
395
1,852
536
171
(2000, inf]
2,097
6,340
4,026
22
[x] Are there experiments with read lengths <25 bp?
Yes there are 890 experiments with read lengths less than 25 bp
[x] Are there experiments with read lengths >160bp?
Yes, and these samples are using different sequencing technologies
Machine
count
454 GS
1
454 GS 20
4
454 GS FLX
29
454 GS FLX Titanium
39
454 GS FLX+
1
454 GS Junior
250
Illumina MiSeq
19
Ion Torrent PGM
3
MinION
1
PacBio RS
1
Definition of done
[x] Plot showing the distribution of read lengths across experiments.
[x] Table of flags for short and long read lengths.
Story
Read length is an important attribute because it can improve specificity of alignment. Read length has increased over time. Initially most studies used a read length of ~35bp, but as sequencing became cheaper and less error prone read length has increased to 50+ bp. The workflow removes reads that are shorter than <25 bp after trimming. Here I look at how read lengths differs across experiments and identify experiments with short read lengths. Read length can also help inform the technology used for sequencing. Illumina is typically ranges between 36bp to 150bp, while other technologies can go into kilobases.
The vast majority of samples (n=24,677) have a read length between 25-160bp. With most samples having a read length of 51 (single end) or 2x95 (pair end). There are 890 samples that are too short (< 25bp), and 193 samples that are really long (≥ 300 bp).
Output
../../output/read_length_downstream_analysis
flag_too_short
True if the read length < 25 bpflag_short
True if 25 bp ≤ read length < 45 bpflag_good
True if 45 bp ≤ read length < 160 bpflag_long
True if 160 bp ≤ read length < 300 bpflag_really_long
True if 300 bp ≤ read lengthQuestions and Tasks
Definition of done