shortRNAhub / shortRNA

short RNA-seq analysis package
GNU General Public License v3.0
1 stars 2 forks source link

Trimming and collapsing from R? #5

Open plger opened 6 years ago

plger commented 6 years ago

Ok, this one I'm not sure how cross-platform we can manage to make this, but ideally what we'd like is to enable the user to go all the way from raw fastq files to the shortRNAexp object all from R. This means:

dktanwar commented 6 years ago

Trimming is possible using QuasR package. preprocessReads

plger commented 6 years ago

The collapsing function should give a count matrix such as this one: seqs.counts.tar.gz ( This one comes from the fastq in /IM/data/SSC/shortRNA/raw )

To reduce a bit the number of useless sequences, before merging I normally exclude sequences that appear only once.

I also include here the alignment file for the same data, as well as the annotation. With these three files you have what's needed to create a shortRNAexp object (using the new version of the function). alignment_and_annotation.tar.gz

dktanwar commented 3 years ago

Hi @plger

Thoughts on Preprocessing of the fastq files:

Checking the Quality of the fastq files and generating a report

I have checked several tools for checking and plotting the quality of fastq files. These tools include:

In my opinion, Rqc package is the one that we can use for reporting quality of the fastq files. Further, some of the plotting might be adapted from the fastqcr package.

Quality control of the fastq files (trimming N's, adapters and removing short reads)

I have checked several tools for the quality of fastq files. These tools include:

But, I think, we have 3 problems here:

  1. Detection of adapter sequences.

    A solution to this problem could be to use plgINS::tryAdapters. But, it is not implemented for paired-end data (I didn't find it).

  2. Trimming based on quality scores.

    A solution to this is using code from page 5 of the ShortRead vignette. I think, it would be easy to adapt.

  3. Trimming Trailing and Leading N's. But, I think, that should be taken care when we define the quality scores for trimming. I am not sure though!

Now, here is what I think needs to be done:

  1. Adapt plgINS::tryAdapters for PE data.
  2. Adapt preprocessReads function from QuasR and/ or adapt adapter_filter function from FastqCleaner

Workflow of quality check and control

Quality check (with a HTML report) --> Check adapters (plot with plgINS::plotAdapterResults) --> Quality control --> Quality re-check (with a HTML report)


We can discuss it during our meeting!