Open swinterbourne opened 3 years ago
The adapters used were: CTGTAGGCACCATCAAT. Some confusion as to whether there are UMI present due the methods being based on the Ingolia et al 2012 Nature protocols and some modifications on the rRNA removal and sample purification procedures which are described here: https://github.com/GCA-VH-lab/RiboSeqPy and https://github.com/GCA-VH-lab/Lab_protocols
As per @ewallace suggestions – troubleshooting to find out whether there are UMIs present or not:
CTGTAGGCACCATCAAT
adapter and no random nucleotides before that, create a config.yaml for that.I ran the Kasari dataset with a config.yaml where umi_regexp: null.
, assuming no UMIs present. It seemed to produce alignments that made sense when comparing the graphs that were produced with the PDFs produced from Weinberg et al (https://github.com/riboviz/example-datasets/issues/20#issuecomment-778411965).
However, for the 3ntframe_propbygene graph the proportions are very similar between frame 0 and 2.
PDFs for the run complete without UMIs are linked here: 3nt_periodicity.pdf 3ntframe_propbygene.pdf codon_ribodens.pdf features.pdf pos_sp_rpf_norm_reads.pdf read_lengths.pdf startcodon_ribogrid.pdf startcodon_ribogridbar.pdf
After a discussion with @FlicAnderson yesterday we have decided to leave this dataset on pause and move on to the next dataset that needs to be added (#55). If further investigation of the outputs are necessary (such as UMI troubleshooting) this will be completed in tandem with any other datasets that require similar investigation.
The full dataset does run to completion, but I think I've found some of the issues @swinterbourne discovered previously when running this to review the PR #63.
I can try and have another look at the methods info in the paper, but I think we're possibly reaching the "email the authors" stage.
I'd advise not closing this PR just yet until we resolve/confirm.
It'd probably also be helpful to re-run using the .html output (riboviz/#193) to allow easier comparison of outputs between samples?
Looking again at this (see comment on pull request) I decided to look over the code provided by the Hauryliuk & Atkinson labs.
Their pipeline part 1 includes alternately commented code in the cutadapt call, described lines 333-340:
#eEF3
CutAdapt = ["cutadapt","--discard-untrimmed", "-j", Params['cpu'], "-a","CTGTAGGCACCATCAAT","-o","2-Trimmed/","1-Raw/"]
CutAdapt[7] += iX + ".fastq.gz"
CutAdapt[8] += iX + ".fastq.gz"
#New1
#CutAdapt = ["cutadapt", "--discard-untrimmed", "-j", Params['cpu'], "-u", "3","-a", "NNNNCTGTAGGCACCATCAAT","-o", "2-Trimmed/", "1-Raw/"]
#CutAdapt[9] += iX + ".fastq.gz"
#CutAdapt[10] += iX + ".fastq.gz"
This probably means that some of the samples have a 4nt UMI while others do not. The riboviz pipeline is not flexible to using different adapters for different samples in the same run (which I do not view as a problem).
I have emailed the senior authors Gemma Atkinson and Vasili Hairyliuk to ask for clarification.
The pull request associated has been closed (branch Kasari-Sc-2019-config-48
is left live, but UNMERGED) as this dataset needs further work to identify the issues we've documented here.
Add the Kasari et al 2019 dataset
The Kasari et al 2019 dataset is from: Ribosome profiling analysis of eEF3-depleted Saccharomyces cerevisiae.
Link to paper: https://www.nature.com/articles/s41598-019-39403-y Link to Ribosome Protected Fragment reads (S1-S4) and RNA-seq reads (S5-S8), E-MTAB-6938: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6938/samples/?s_sortby=col_28&s_sortorder=ascending