plger commented 6 years ago

Ok, this one I'm not sure how cross-platform we can manage to make this, but ideally what we'd like is to enable the user to go all the way from raw fastq files to the shortRNAexp object all from R. This means:

[ ] Function to obtain the possible adapter sequence
[ ] trimming a list of raw fastq files for adapters (that's lowest priority - if there's no good solution we can skip that...)
[x] collapsing each fastq file (what collapse.sh currently does)
[x] combining them into a count matrix (what collapsed2countMatrix.sh does)

dktanwar commented 6 years ago

Trimming is possible using QuasR package. preprocessReads

plger commented 6 years ago

The collapsing function should give a count matrix such as this one: seqs.counts.tar.gz ( This one comes from the fastq in /IM/data/SSC/shortRNA/raw )

To reduce a bit the number of useless sequences, before merging I normally exclude sequences that appear only once.

I also include here the alignment file for the same data, as well as the annotation. With these three files you have what's needed to create a shortRNAexp object (using the new version of the function). alignment_and_annotation.tar.gz

dktanwar commented 3 years ago

Hi @plger

Thoughts on Preprocessing of the `fastq` files:

Checking the Quality of the `fastq` files and generating a report

I have checked several tools for checking and plotting the quality of fastq files. These tools include:

Rqc: It reads the fastq file and plots the results.
ShortRead: Plotting is missing.
fastqcr: Uses the results from the FastQC tool to make plots.

In my opinion, Rqc package is the one that we can use for reporting quality of the fastq files. Further, some of the plotting might be adapted from the fastqcr package.

Quality control of the `fastq` files (trimming N's, adapters and removing short reads)

I have checked several tools for the quality of fastq files. These tools include:

Rbowtie2: It is an R wrapper for adapterremoval. It provides adapterremoval binaries for Mac, Linux and Windows. I am not sure if we should use it (because it is using a tool not written in R).
QuasR: This looks good to me.
ShortRead: It might take a bit of more work to adapt and use for quality control.
FastqCleaner: It is a wrapper around ShortRead. So, also looks good to me.

But, I think, we have 3 problems here:

Detection of adapter sequences.

A solution to this problem could be to use plgINS::tryAdapters. But, it is not implemented for paired-end data (I didn't find it).
Trimming based on quality scores.

A solution to this is using code from page 5 of the ShortRead vignette. I think, it would be easy to adapt.
Trimming Trailing and Leading N's. But, I think, that should be taken care when we define the quality scores for trimming. I am not sure though!

Now, here is what I think needs to be done:

Adapt plgINS::tryAdapters for PE data.
Adapt preprocessReads function from QuasR and/ or adapt adapter_filter function from FastqCleaner

Workflow of quality check and control

Quality check (with a HTML report) --> Check adapters (plot with plgINS::plotAdapterResults) --> Quality control --> Quality re-check (with a HTML report)

We can discuss it during our meeting!

shortRNAhub / shortRNA

Trimming and collapsing from R? #5

Thoughts on Preprocessing of the `fastq` files:

Checking the Quality of the `fastq` files and generating a report

Quality control of the `fastq` files (trimming N's, adapters and removing short reads)

Workflow of quality check and control

shortRNAhub / shortRNA

Trimming and collapsing from R? #5

Thoughts on Preprocessing of the fastq files:

Checking the Quality of the fastq files and generating a report

Quality control of the fastq files (trimming N's, adapters and removing short reads)

Workflow of quality check and control

Thoughts on Preprocessing of the `fastq` files:

Checking the Quality of the `fastq` files and generating a report

Quality control of the `fastq` files (trimming N's, adapters and removing short reads)