jfear / ncbi_remap

This is the drosSRA project, where we are remapping all Drosophila melanogaster RNA-seq data to FlyBase release 6 and updating annotations.
2 stars 1 forks source link

Summarize library size #41

Closed jfear closed 7 years ago

jfear commented 7 years ago

Story

Sequencing technology has change a lot in the past 5 years. There has been a substantial increase in the number of reads produced by a single sequencing run, and an increase in the number of bases that can be sequenced. During the pre-pre-alignment workflow I estimate library size and calculate the average read length. I would like to remove overtly low quality samples that have a small number of reads and reads that are very short.

Here I look at these distributions and I can see that most samples have more than 1 million reads. There is a small number of samples (~2k) that have fewer than 100,000 reads. I have decided to flag these samples as having a low library size. If I further summarize to SRX, there are ~400 experiments with fewer than 100k total reads.

In addition I looked at the average read length and found that most samples either have a read length of 50 or 75-100 bp. There are ~700 samples that have an average read length less than 30. I have flagged these samples as having a low read length. Note the alignment pipeline throws out individual reads that are ≤20 bp.

Questions and Tasks

Definition of done