Troubleshoot and Add Kasari et al 2019 Saccharomyces cerevisiae dataset

swinterbourne commented 3 years ago

Add the Kasari et al 2019 dataset

The Kasari et al 2019 dataset is from: Ribosome profiling analysis of eEF3-depleted Saccharomyces cerevisiae.

Link to paper: https://www.nature.com/articles/s41598-019-39403-y Link to Ribosome Protected Fragment reads (S1-S4) and RNA-seq reads (S5-S8), E-MTAB-6938: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6938/samples/?s_sortby=col_28&s_sortorder=ascending

swinterbourne commented 3 years ago

The adapters used were: CTGTAGGCACCATCAAT. Some confusion as to whether there are UMI present due the methods being based on the Ingolia et al 2012 Nature protocols and some modifications on the rRNA removal and sample purification procedures which are described here: https://github.com/GCA-VH-lab/RiboSeqPy and https://github.com/GCA-VH-lab/Lab_protocols

As per @ewallace suggestions – troubleshooting to find out whether there are UMIs present or not:

Assume the dataset just has the CTGTAGGCACCATCAAT adapter and no random nucleotides before that, create a config.yaml for that.
Downsample a dataset and run through riboviz, see if we get sensible alignments. If random nucleotides are present in the read that were not removed, then this would prevent the alignment of almost all reads (to either rRNA or mRNA), leading to the problem of no aligned reads experienced in Weinberg et al (https://github.com/riboviz/example-datasets/issues/20)
If there are no sensible alignments: find the overrepresented sequences in a FastQC on the input fastq, those are likely to be yeast rRNA. Can align to the yeast genome by BLAST and check if there's a read structure rRNAfragment-randomthing-CTATACCCACCATACATT. Then the random thing is going to be the length of the UMI. Remember you can use the read id to pair entries in the fastq and the .bam if that helps.
If there are no sensible alignments try UMIs: put in the best-guess length of UMI to the config.yaml (in another branch if helpful). Run through riboviz on a downsampled dataset.
If there are plenty of sensible alignments and everything looks good: scale up to the full dataset.
If it still does not make sense: email authors.

swinterbourne commented 3 years ago

I ran the Kasari dataset with a config.yaml where umi_regexp: null., assuming no UMIs present. It seemed to produce alignments that made sense when comparing the graphs that were produced with the PDFs produced from Weinberg et al (https://github.com/riboviz/example-datasets/issues/20#issuecomment-778411965).

However, for the 3ntframe_propbygene graph the proportions are very similar between frame 0 and 2.

PDFs for the run complete without UMIs are linked here: 3nt_periodicity.pdf 3ntframe_propbygene.pdf codon_ribodens.pdf features.pdf pos_sp_rpf_norm_reads.pdf read_lengths.pdf startcodon_ribogrid.pdf startcodon_ribogridbar.pdf

swinterbourne commented 3 years ago

After a discussion with @FlicAnderson yesterday we have decided to leave this dataset on pause and move on to the next dataset that needs to be added (#55). If further investigation of the outputs are necessary (such as UMI troubleshooting) this will be completed in tandem with any other datasets that require similar investigation.

FlicAnderson commented 3 years ago

The full dataset does run to completion, but I think I've found some of the issues @swinterbourne discovered previously when running this to review the PR #63.

I can try and have another look at the methods info in the paper, but I think we're possibly reaching the "email the authors" stage.

I'd advise not closing this PR just yet until we resolve/confirm.

It'd probably also be helpful to re-run using the .html output (riboviz/#193) to allow easier comparison of outputs between samples?

ewallace commented 3 years ago

Looking again at this (see comment on pull request) I decided to look over the code provided by the Hauryliuk & Atkinson labs.

Their pipeline part 1 includes alternately commented code in the cutadapt call, described lines 333-340:

#eEF3
CutAdapt     = ["cutadapt","--discard-untrimmed", "-j", Params['cpu'], "-a","CTGTAGGCACCATCAAT","-o","2-Trimmed/","1-Raw/"]
CutAdapt[7] += iX + ".fastq.gz"
CutAdapt[8] += iX + ".fastq.gz"
#New1
#CutAdapt = ["cutadapt", "--discard-untrimmed", "-j", Params['cpu'], "-u", "3","-a", "NNNNCTGTAGGCACCATCAAT","-o", "2-Trimmed/", "1-Raw/"]
#CutAdapt[9] += iX + ".fastq.gz"
#CutAdapt[10] += iX + ".fastq.gz"

This probably means that some of the samples have a 4nt UMI while others do not. The riboviz pipeline is not flexible to using different adapters for different samples in the same run (which I do not view as a problem).

I have emailed the senior authors Gemma Atkinson and Vasili Hairyliuk to ask for clarification.

FlicAnderson commented 3 years ago

The pull request associated has been closed (branch Kasari-Sc-2019-config-48 is left live, but UNMERGED) as this dataset needs further work to identify the issues we've documented here.

riboviz / example-datasets

Troubleshoot and Add Kasari et al 2019 Saccharomyces cerevisiae dataset #48