Sequencing technology has change a lot in the past 5 years. There has been a substantial increase in the number of reads produced by a single sequencing run, and an increase in the number of bases that can be sequenced. During the pre-pre-alignment workflow I estimate library size and calculate the average read length. I would like to remove overtly low quality samples that have a small number of reads and reads that are very short.
Here I look at these distributions and I can see that most samples have more than 1 million reads. There is a small number of samples (~2k) that have fewer than 100,000 reads. I have decided to flag these samples as having a low library size. If I further summarize to SRX, there are ~400 experiments with fewer than 100k total reads.
In addition I looked at the average read length and found that most samples either have a read length of 50 or 75-100 bp. There are ~700 samples that have an average read length less than 30. I have flagged these samples as having a low read length. Note the alignment pipeline throws out individual reads that are ≤20 bp.
Questions and Tasks
[x] Plot the distribution of library sizes by SRR
[x] Develop a library size cutoff.
[x] How many SRRs fall below our library size cutoff?
2,102 runs have less than 100k reads.
SRR1992316 had the smallest library size (73 reads)
[x] How many SRRs fall below our read length cutoff?
697 runs had an average read length of 30 bp.
ERR297221 had the smallest average read length (~10.02 bp).
[x] If we combine libraries within an SRX does that increase the number of samples that reach our cutoff?
There are 369 SRX that have less than 100,000 total reads.
Definition of done
[x] Plot of the distribution of library sizes.
[x] Cutoff criteria for removing potentially problematic samples with low number of reads.
< 100,000 reads
< 30 bp
[x] Table with flags indicating which samples should or should not be removed do to library size criteria.
This notebook outputs a table with flags ../../output/libsize_downstream_analysis.pkl
libsize The total number of reads (for PE only R1).
len Average read length (for PE only R1).
flag_low_libsize Flag is True when library size ≤ 100,000 reads.
flag_short_read_len Flag is True when read length ≤ 30 bp.
Story
Sequencing technology has change a lot in the past 5 years. There has been a substantial increase in the number of reads produced by a single sequencing run, and an increase in the number of bases that can be sequenced. During the pre-pre-alignment workflow I estimate library size and calculate the average read length. I would like to remove overtly low quality samples that have a small number of reads and reads that are very short.
Here I look at these distributions and I can see that most samples have more than 1 million reads. There is a small number of samples (~2k) that have fewer than 100,000 reads. I have decided to flag these samples as having a low library size. If I further summarize to SRX, there are ~400 experiments with fewer than 100k total reads.
In addition I looked at the average read length and found that most samples either have a read length of 50 or 75-100 bp. There are ~700 samples that have an average read length less than 30. I have flagged these samples as having a low read length. Note the alignment pipeline throws out individual reads that are ≤20 bp.
Questions and Tasks
Definition of done
../../output/libsize_downstream_analysis.pkl
libsize
The total number of reads (for PE only R1).len
Average read length (for PE only R1).flag_low_libsize
Flag is True when library size ≤ 100,000 reads.flag_short_read_len
Flag is True when read length ≤ 30 bp.