Closed nh13 closed 4 years ago
I'd also be curious about your thoughts on ignoring 5' and 3' ends of the open reading frame from this paper:
For ribosome profiling, frequent read pile-ups are observed at the 50- and 30- ends of an open reading frame which are largely uninformative to a gene’s translational efficiency [4]. While these pile-ups can be indicative of true translation dynamics [56], current best-practices have more recently settled on ignoring these regions during read quantification and calculations of translation efficiency [3, 21]. By providing the --truncate argument during reference curation, the 50- and 30- ends of each coding region will be recursively trimmed until the speci- fied amounts are removed from coding space. A recursive strategy is required here as GTF file-formats split the CDS (coding sequence of gene) record into regions separated by introns. By default, 45 nt (nucleotides) will be trimmed from the 50-ends and 15 nt from the 30-ends recursively until the full length is removed from coding space, as is the current convention within the ribosome profiling field [3]. The resulting output file will then be used to process ribosome footprint libraries and their corresponding bulk RNA-Seq libraries. If generating a GTF file for use solely with general bulk RNA-Seq datasets, this file should not be truncated.
Hi, here a summary for the preprocessing I used.
Jurkat: This is cutadapt 1.8 with Python 3.4.3 Command line parameters: -m 18 -a CTGTAGGCACCATCAAT -o trim_cut.fastq -
HeLa, U2OS, HepG2 and K562: This is cutadapt 1.8 with Python 3.4.3 Command line parameters: -m 18 --discard-untrimmed -a TGGAATTCTCGGGTGCCAAGG -g GTTCAGAGTTCTACAGTCCGACGATC -o trim_cut.fastq -
HEK23: the fastq in the GEO accession should have been already trimmed, please let me know if I'm wrong. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73136
Bowtie 2 version 2.3.4.1 in -local alignment mode using default parameters. Human rRNA: https://www.ncbi.nlm.nih.gov/nuccore/555853 snoRNA, miRNA and tRNA were downloaded using the UCSC table browser. I can provide the fasta files if you need but I suspect there might be better ways to get those sequences.
STAR 2.6.0a with following parameters (single pass):
STAR --genomeDir
hg38 genome with no scaffolds. GENCODE version25 annotation, which merges HAVANA and ENSEMBL annotation.
Not a fan of deleting regions. What I do often is to count reads over a certain mapping length (>25 nts) which is usually more than enough to remove artifacts IMHO. However there's no perfect way, and heuristics like that can be helpful in some datasets (e.g. high pileups at the +5 codons etc....).
Reading your excellent ORFquant preprint, I had a few clarifications:
cutadapt
did you use? (version too)