lcalviell / ORFquant

An R package for Splice-aware quantification of translation using Ribo-seq data
GNU General Public License v3.0
15 stars 13 forks source link

ORFquant preprint clarifications #6

Closed nh13 closed 4 years ago

nh13 commented 4 years ago

Reading your excellent ORFquant preprint, I had a few clarifications:

Ribo-seq reads were stripped of their adapters using cutadapt

Reads aligning to rRNA, snoRNAs and tRNA sequences were removed with Bowtie2

Unaligned reads were then mapped with STAR45 using the hg38 genome and the GENCODE 25 annotation in GTF format

nh13 commented 4 years ago

I'd also be curious about your thoughts on ignoring 5' and 3' ends of the open reading frame from this paper:

For ribosome profiling, frequent read pile-ups are observed at the 50- and 30- ends of an open reading frame which are largely uninformative to a gene’s translational efficiency [4]. While these pile-ups can be indicative of true translation dynamics [56], current best-practices have more recently settled on ignoring these regions during read quantification and calculations of translation efficiency [3, 21]. By providing the --truncate argument during reference curation, the 50- and 30- ends of each coding region will be recursively trimmed until the speci- fied amounts are removed from coding space. A recursive strategy is required here as GTF file-formats split the CDS (coding sequence of gene) record into regions separated by introns. By default, 45 nt (nucleotides) will be trimmed from the 50-ends and 15 nt from the 30-ends recursively until the full length is removed from coding space, as is the current convention within the ribosome profiling field [3]. The resulting output file will then be used to process ribosome footprint libraries and their corresponding bulk RNA-Seq libraries. If generating a GTF file for use solely with general bulk RNA-Seq datasets, this file should not be truncated.

lcalviell commented 4 years ago

Hi, here a summary for the preprocessing I used.

Jurkat: This is cutadapt 1.8 with Python 3.4.3 Command line parameters: -m 18 -a CTGTAGGCACCATCAAT -o trim_cut.fastq -

HeLa, U2OS, HepG2 and K562: This is cutadapt 1.8 with Python 3.4.3 Command line parameters: -m 18 --discard-untrimmed -a TGGAATTCTCGGGTGCCAAGG -g GTTCAGAGTTCTACAGTCCGACGATC -o trim_cut.fastq -

HEK23: the fastq in the GEO accession should have been already trimmed, please let me know if I'm wrong. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73136

Bowtie 2 version 2.3.4.1 in -local alignment mode using default parameters. Human rRNA: https://www.ncbi.nlm.nih.gov/nuccore/555853 snoRNA, miRNA and tRNA were downloaded using the UCSC table browser. I can provide the fasta files if you need but I suspect there might be better ways to get those sequences.

STAR 2.6.0a with following parameters (single pass): STAR --genomeDir --readFilesIn --runThreadN 2 --outFilterMismatchNmax 2 --outFilterMultimapNmax 20 --chimScoreSeparation 10 --chimScoreMin 20 --chimSegmentMin 15 --outSAMattributes NH HI AS nM NM MD XS --outFilterIntronMotifs RemoveNoncanonicalUnannotated --alignSJoverhangMin 500 --outFileNamePrefix $nameexp"" --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outWigType bedGraph --outWigNorm RPM --outSAMstrandField intronMotif --outSAMmultNmax 1 --outMultimapperOrder Random --limitBAMsortRAM 19000000000 --outFilterMismatchNoverLmax 0.04

hg38 genome with no scaffolds. GENCODE version25 annotation, which merges HAVANA and ENSEMBL annotation.

lcalviell commented 4 years ago

Not a fan of deleting regions. What I do often is to count reads over a certain mapping length (>25 nts) which is usually more than enough to remove artifacts IMHO. However there's no perfect way, and heuristics like that can be helpful in some datasets (e.g. high pileups at the +5 codons etc....).