Summarize library size - Githubissues

Story

Sequencing technology has change a lot in the past 5 years. There has been a substantial increase in the number of reads produced by a single sequencing run, and an increase in the number of bases that can be sequenced. During the pre-pre-alignment workflow I estimate library size and calculate the average read length. I would like to remove overtly low quality samples that have a small number of reads and reads that are very short.

Here I look at these distributions and I can see that most samples have more than 1 million reads. There is a small number of samples (~2k) that have fewer than 100,000 reads. I have decided to flag these samples as having a low library size. If I further summarize to SRX, there are ~400 experiments with fewer than 100k total reads.

In addition I looked at the average read length and found that most samples either have a read length of 50 or 75-100 bp. There are ~700 samples that have an average read length less than 30. I have flagged these samples as having a low read length. Note the alignment pipeline throws out individual reads that are ≤20 bp.

Questions and Tasks

[x] Plot the distribution of library sizes by SRR
[x] Develop a library size cutoff.
[x] How many SRRs fall below our library size cutoff?
- 2,102 runs have less than 100k reads.
- SRR1992316 had the smallest library size (73 reads)
[x] How many SRRs fall below our read length cutoff?
- 697 runs had an average read length of 30 bp.
- ERR297221 had the smallest average read length (~10.02 bp).
[x] If we combine libraries within an SRX does that increase the number of samples that reach our cutoff?
- There are 369 SRX that have less than 100,000 total reads.

Definition of done

[x] Plot of the distribution of library sizes.
[x] Cutoff criteria for removing potentially problematic samples with low number of reads.
- < 100,000 reads
- < 30 bp
[x] Table with flags indicating which samples should or should not be removed do to library size criteria.
- This notebook outputs a table with flags ../../output/libsize_downstream_analysis.pkl
  - libsize The total number of reads (for PE only R1).
  - len Average read length (for PE only R1).
  - flag_low_libsize Flag is True when library size ≤ 100,000 reads.
  - flag_short_read_len Flag is True when read length ≤ 30 bp.

jfear / ncbi_remap

Summarize library size #41

Story

Questions and Tasks

Definition of done