tgirke / systemPipeR

Project Website:
http://girke.bioinformatics.ucr.edu/systemPipeR
52 stars 38 forks source link

RNAseq workflow mapping human genome #81

Closed antoine4ucsd closed 4 months ago

antoine4ucsd commented 9 months ago

Hello thank you for providing such an impressive platform I am new with RNAseq data processing. I have a dataset of 80 samples that need to be processed, mapped to human genome, and anlayses for DE between groups. I wanted to start witih a couple of samples but I encounter a few issue/limitations. I will try to give you details below

systemPipeRdata::genWorkenvir(workflow = "new", mydirname = "spr_project")
setwd("spr_project")

My data are saved in a specifc folder so I created my now targetPE file targetsPE.example.txt

targetspath <- "targetsPE.example.txt"
targets <- read.delim(targetspath, comment.char = "#")
targets[1:2, -c(5, 6)]

Next I start preparing the workflow

library(systemPipeR)
sal <- SPRproject()

cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'GenomicFeatures", "BiocParallel", "DESeq2",
       "ape", "edgeR", "biomaRt", "pheatmap","ggplot2'\n"), sep = "', '")
###pre-end
appendStep(sal) <- LineWise(code = {
                library(systemPipeR)
                }, step_name = "load_SPR")

Read preprocessing

appendStep(sal) <- SYSargsList(
    step_name = "preprocessing",
    targets = "targetsPE.example.txt", dir = TRUE,
    wf_file = "preprocessReads/preprocessReads-pe.cwl",
    input_file = "preprocessReads/preprocessReads-pe.yml",
    dir_path = system.file("extdata/cwl", package = "systemPipeR"),
    inputvars = c(
        FileName1 = "_FASTQ_PATH1_",
        FileName2 = "_FASTQ_PATH2_",
        SampleName = "_SampleName_"
    ),
    dependency = c("load_SPR"))

# comment:  I add to manually add the filterFct function in the file ./param/docopt.R/preprocessReads/preprocessReads.doc.R
appendStep(sal) <- LineWise(
    code = {
        filterFct <- function(fq, cutoff = 20, Nexceptions = 0) {
            qcount <- rowSums(as(quality(fq), "matrix") <= cutoff, na.rm = TRUE)
            # Retains reads where Phred scores are >= cutoff with N exceptions
            fq[qcount <= Nexceptions]
        }
        save(list = ls(), file = "param/customFCT.RData")
    },
    step_name = "custom_preprocessing_function",
    dependency = "preprocessing"
)

trimmomatic : while the script work with trimmomatic , I am trying without it first

quality report

this step is working. definitely not good output format when many format. would be better to save by sample IMO.

appendStep(sal) <- LineWise(code = {
                fastq <- getColumn(sal, step = "preprocessing", "targetsWF", column = 1)
                fqlist <- seeFastq(fastq = fastq, batchsize = 10000, klength = 8)
                png("./results/fastqReport.png", height = 162* length(fqlist)/2, width = 288 * length(fqlist))
                seeFastqPlot(fqlist)
                dev.off()
                }, step_name = "fastq_report", 
                dependency = "preprocessing")

Alignments

Read mapping with HISAT2

Here, HiSAT2 can work without error but I would need some guidance to align to hg. any chance you have a template? I do have each chromosome in fasta file and I can also download indexes here https://daehwankimlab.github.io/hisat2/download/

appendStep(sal) <- SYSargsList(
    step_name = "hisat2_index", 
    dir = FALSE, 
    targets=NULL, 
    wf_file = "hisat2/hisat2-index.cwl", 
    input_file="hisat2/hisat2-index.yml",
    dir_path="param/cwl", 
    dependency = "load_SPR"
)

appendStep(sal) <- SYSargsList(
    step_name = "hisat2_mapping",
    dir = TRUE, 
    targets ="preprocessing",
    wf_file = "workflow-hisat2/workflow_hisat2-pe.cwl",
    input_file = "workflow-hisat2/workflow_hisat2-pe.yml",
    dir_path = "param/cwl",
    inputvars = c(preprocessReads_1 = "_FASTQ_PATH1_", preprocessReads_2 = "_FASTQ_PATH2_", 
                  SampleName = "_SampleName_"),
    rm_targets_col = c("FileName1", "FileName2"), 
    dependency = c("preprocessing", "hisat2_index")
)

but my understanding is that it aligns to the ref in the default data folder (tair10.fasta)?

so my resulting alignemnt is as follow (~1% align)

read.table("results/alignStats.xls", header = TRUE)[1:4, ]

FileName Nreads2x Nalign Perc_Aligned Nalign_Primary Perc_Aligned_Primary 1 D8_C1 67737666 7154 0.010561 7154 0.010561 2 D8_CPLUSI1 45082378 5210 0.011557 5210 0.011557

and after changing the ref genome , I would also need some guidance to adapt the cmdline below

```{r create_db, eval=FALSE, spr=TRUE}
appendStep(sal) <- LineWise(
    code = {
        library(GenomicFeatures)
        txdb <- suppressWarnings(makeTxDbFromGFF(file="data/tair10.gff", format="gff", dataSource="TAIR", organism="Arabidopsis thaliana"))
        saveDb(txdb, file="./data/tair10.sqlite")
        }, 
    step_name = "create_db", 
    dependency = "hisat2_mapping")

I know this is a lot but I was hoping other may have the same kind of data to process....

lz100 commented 8 months ago

The major problem here is the mapping step. Example code is for tair10, you need to change it. For this step,

  1. Download human genome fasta (hg38 or hg19) to the data directory. Not the per-chromosome fasta, the whole genome fasta, unless you only want to align to a single chromosome.
  2. in workflow-hisat2/workflow_hisat2-pe.yml replace the tair reference path to your new ref genome path, like data/hg38.fasta. The index hg38.fasta.fai file should also be placed in the data.
  3. for the create_db step, similarly, download hg38.gff and replace the file, dataSource, and organism

Reading this page will help you to understand how the yml file is working in a workflow step: https://systempipe.org/sp/spr/cwl/cwl_and_spr/

@tgirke Would you mind giving additional guidance if any?

antoine4ucsd commented 8 months ago

thank you for taking the time to get back to me.

tgirke commented 4 months ago

closed