Read Length Downstream Analysis

Story

Read length is an important attribute because it can improve specificity of alignment. Read length has increased over time. Initially most studies used a read length of ~35bp, but as sequencing became cheaper and less error prone read length has increased to 50+ bp. The workflow removes reads that are shorter than <25 bp after trimming. Here I look at how read lengths differs across experiments and identify experiments with short read lengths. Read length can also help inform the technology used for sequencing. Illumina is typically ranges between 36bp to 150bp, while other technologies can go into kilobases.

The vast majority of samples (n=24,677) have a read length between 25-160bp. With most samples having a read length of 51 (single end) or 2x95 (pair end). There are 890 samples that are too short (< 25bp), and 193 samples that are really long (≥ 300 bp).

Output

Distribution plot of read lengths separated by Single End and Pair End data. I focus on reads ≤ 160 bp for plots.
Output table with flags fitting different criteria. ../../output/read_length_downstream_analysis
- flag_too_short True if the read length < 25 bp
- flag_short True if 25 bp ≤ read length < 45 bp
- flag_good True if 45 bp ≤ read length < 160 bp
- flag_longTrue if 160 bp ≤ read length < 300 bp
- flag_really_longTrue if 300 bp ≤ read length

Questions and Tasks

[x] What is the distribution of read length?

Of the 27,138 samples, the majority (n=24,677) have reads between 25-160bp. The most frequent read length in Single End data was 51 bp, while the most frequent read length in Pair End data was 95 bp.

bin	min	max	median	count
(20, 40]	21	40	36	3,845
(40, 60]	41	60	51	8,066
(60, 80]	61	80	76	5,488
(80, 1000]	81	100	95	6,552
(100, 120]	101	120	101	1,382
(120, 140]	123	140	125	221
(140, 160]	141	160	151	239
(160, 300]	164	298	240	344
(300, 2000]	395	1,852	536	171
(2000, inf]	2,097	6,340	4,026	22

[x] Are there experiments with read lengths <25 bp?

Yes there are 890 experiments with read lengths less than 25 bp
[x] Are there experiments with read lengths >160bp?

Yes, and these samples are using different sequencing technologies

Machine	count
454 GS	1
454 GS 20	4
454 GS FLX	29
454 GS FLX Titanium	39
454 GS FLX+	1
454 GS Junior	250
Illumina MiSeq	19
Ion Torrent PGM	3
MinION	1
PacBio RS	1

Definition of done

[x] Plot showing the distribution of read lengths across experiments.
[x] Table of flags for short and long read lengths.

jfear / ncbi_remap