Open ewallace opened 4 years ago
Meeting with Isabel organised for Friday 02 Oct ~10:30am to discuss the dataset, and ask/answer any basic set-up questions.
Edited initial comment to fix the check-boxes, since they're super-handy.
I've created branch drosophila-aspden-10
and added some file structure within that:
example-datasets/
Within contaminants, I've added: 1) Drosophila_melanogaster_fly_rRNA.fasta This file was created as follows: The Drosophila melanogaster 'D_mel_rRNA.fasta' file for rRNA was downloaded from Ribogalaxy Data Library Original Filename: D_mel_rRNA.fasta Downloaded from: https://ribogalaxy.ucc.ie/library_common/download_dataset_from_folder?library_id=03501d7626bd192f&cntrller=library&use_panels=False&id=52e496b945151ee8 File Information: https://ribogalaxy.ucc.ie/library_common/ldda_info?library_id=03501d7626bd192f&show_deleted=False&cntrller=library&folder_id=03501d7626bd192f&use_panels=False&id=52e496b945151ee8 The file was then moved and renamed. Current Filename: Drosophila_melanogaster_fly_rRNA.fasta
2) Drosophila_melanogaster_fly_tRNA_r6-35.fasta This file was created as follows: The Drosophila melanogaster genome release 'dmel_r6.35' file for tRNA was downloaded from Flybase.net Original Filename:dmel-all-tRNA-r6.35.fasta.gz Downloaded from: ftp://ftp.flybase.net/releases/FB2020_04/dmel_r6.35/fasta/dmel-all-tRNA-r6.35.fasta.gz The file was then unzipped and the fasta file moved and renamed. Current Filename: Drosophila_melanogaster_fly_tRNA_r6-35.fasta
3) 2x _provenance.txt files; one for each of those files, containing the information I've added here.
:question: Given that riboviz takes a single contaminants file, I guess that there might be a bit of work involved in concatonating both files into one? Is this correct @ewallace ? Hopefully their formats will be similar, but that's something I can check.
:question: I'll definitely check in with Isabel tomorrow to see if these are the best files to use for rRNA
and tRNA
, or if I've pulled out the wrong ones.
:question: There's an overwhelming amount of different options for fasta/gff files on the genome release on flybase.net, but not rRNA - did I overlook one? Have got the Ribogalaxy data library version, but I'll need to do more work to find out where that came from.
@isabelbirds Tagging you in - hope you get this!
Notes from chat:
[x] tRNA and rRNA correct files.
riboviz is transcriptome/canonical ORFs only at the moment.
Does have an option to include RNA-seq, not tested recently. This may be helpful for resolving multi-mapping reads, allows to check overlap of splice sites. Other software tackles this by removing ORFs that have a given % only supported by multimapping coverage.
❓ @FlicAnderson by buffer region do you mean 3' and 5' UTR lengths? If so, solution could be to use the 3/5'UTR fastas from Flybase to get buffer length?
ORFeome files:
gff
fasta file
Non ATG starts:
Dicistronic transcripts:
Ribo/RNA-seq data:
To do:
Update - Aspden dataset doesn't contain UMIs.
@IsabelBirds Thanks for checking the UMI stuff and getting those other details back to me. Super useful!
RE: buffer region, we describe it as "Length of flanking region around the CDS" and we've treated this as UTR length, yep. But, we've been presently using 1 value for this (E.g. using '250' for S.cerevisiae to match our current 'best guess' annotation files.)
In terms of using the 3' and 5' UTR lengths from the Flybase annotation files, that's a good shout but I don't think that would work in our code as-is, since we're expecting this one value (applying it on both sides, and for everything!).
I'm not 100% clear what our existing plans are for using specific UTR lengths for each gene, as opposed to uniformly applying this one generic value, which is where we currently are. @ewallace is this something you have any ideas on?
@FlicAnderson I was thinking that using 3' and 5' UTR annotations might actually work as good generic fix, not necessarily to get specific lengths for each transcript- unless it's a value that doesn't actually make that much difference to results.
i.e. for a given species take these files, plot distribution of UTR lengths to check for any weirdness/outliers, and take the overall average for the buffer length value. Removes some of the guessing if you're using a public dataset?
@IsabelBirds Oooh, that's a good idea. I was thinking about doing that if we needed a sort of temporary number to work with in the short term.
I'm not 100% clear what our existing plans are for using specific UTR lengths for each gene, as opposed to uniformly applying this one generic value, which is where we currently are. @ewallace is this something you have any ideas on?
Riboviz itself doesn't care how you choose UTR lengths, whether biologically realistic or not. It currently will quantify one ORF per sequence supplied in the "transcript fasta file", and the ORF positions are described in the gff. This could be a canonical ORF with annotated UTRs from data, or any ORF you want with any buffer size you want, e.g. minimally a 18nt-padding at each and to keep track of fragments at start and stop codons. So which makes sense here depends on ease of setup and the scientific question.
I'd suggest first getting it working on Drosophila for just canonical ORFs with fixed-length padding if that is easier. If you have an annotation with "real" UTR lengths and it's easier to do that, it's of course better. Then after we run on full-length real transcripts, we can figure out how to adjust the code to quantify multiple ORFs per transcript, which is issue riboviz#225.
Tagging in the issue riboviz/#83 because this is a key thing to be able to communicate to users :) @ewallace will consider if there's a better naming option
Also a reminder to self to create some downsampled data to work with FIRST, to get things working asap and help troubleshooting.
Concatenated the rRNA and tRNA files into one contaminants file: Drosophila_melanogaster_fly_rRNA_tRNA.fasta
and added Drosophila_melanogaster_fly_rRNA_tRNA_provenance.txt
with details on how it was created and which rRNA and tRNA files were used.
@FlicAnderson to make a plan on how to approach getting the GFF and Fasta files into riboviz format.
Also worth me setting this dataset up on Eddie so I can run things on there, rather than working from my local machine.
Worth noting that running check_fasta_gff.py
locally took a long time, which is why I had started setting up to do this on Eddie, freeing up my comp for other things. I ran wget
to download both files into a folder on my drive, since I don't expect them to increase in size just from running check_fasta_gff.py
:grin: I'll obviously put them into a sensible place when I attempt to run them.
I don't know if it's worthwhile committing the fasta
and gff
files to the example-datasets
repo until I know more about the formatting and how it works or doesn't work with riboviz, because presumably that would add size to the repo unnecessarily.
I also need to subsample the fastq
file to get something smaller to run on. This is something that I could probably do reasonably quickly once I concentrate on doing that. This might be a task for Tuesday afternoon before/after the dev meeting (03rd Nov).
Planning to work on the fasta and gff formats on afternoon Wednesday 04th Nov, ahead of next meeting with Isabel on Thurs 05th at 1600.
@FlicAnderson from what I can tell, the current gff formatting script creates buffers directly up and downstream of the CDS, i.e. regardless of intron/exon structure. Is this approach ok or do I need to ensure that I'm only pulling buffer from exon sequence? It is a lot more straightforward to carry on like this, but not sure how this would affect the rest of the pipeline.
Also following on from what @ewallace said, for the majority of ORFs it is easier to pull the canonical UTRs and add buffer for any ORFs without UTRs/with particularly short UTRs. However some annotations (gencode 👀) don't label which is 3 and which is 5 prime so only true for most annotations.
Also also, do we want to start with a script that pulls one ORF per transcript, or start with the goal of all canonical ORFs? Can always filter the result so you have something to work on for now?
Hi @IsabelBirds, I think if you're able to get anything together that I can start testing, that would be great. Maybe start with one ORF per transcript for now? and in terms of buffer, whatever is easiest to grab currently is the best place to start.
There'll be a lot of downstream tinkering required I think, so if you're able to get a transcript-centric fasta and gff file to me as soon as you can, I can crack on with finding the next set of issues in the pipeline, and then go back and figure out the 'best' way to set these files up once we know what else will be required to get them running.
Does this help answer your questions at all?
From: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1477470 I downloaded the "small polysome footprints" RPF sample via wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/006/SRR1548656/SRR1548656.fastq.gz
then unzipped it.
I ran subsample_bioseqfile.py
(from subsample-fastq-38
branch, currently a PR) on it to get a subsampled version with only 100,000 reads (instead of the full ~100 million) via :
$ python ~/riboviz/riboviz/pyscripts/subsample_bioseqfile.py -in SRR1548656_copy.fastq -prob 0.001 -out SRR1548656_subsampled.fastq -ftype fastq
subsampling complete; read 99973197 records from SRR1548656_copy.fastq, wrote 100042 records to SRR1548656_subsampled.fastq
... and after checking it contained ~100000 reads, I renamed it as "SRR1548656_100k_reads_subsampled.fastq".
Just pushed a slight update to the config .yaml
, including the subsampled fastq
file so it's ready to try running once the custom fasta
and gff
files are in place. Also pushed commit where I'd added the adapter in, since I realised this hadn't been pushed yet.
Have another meeting with @IsabelBirds tomorrow (Thursday 03rd) after a week where I've been focussing on other tasks and not progressed this one yet.
Hi @FlicAnderson, have uploaded:
All stashed here
Also note - have used standard five_prime_UTR and three_prime_UTR notation, not 3UTR/5UTR
Next meeting with @IsabelBirds is on Thurs 07th Jan (4PM UK time), and I'll be looking through the files she's uploaded and trying to run riboviz on them in the morning beforehand to be able to ask any questions or double-check my understanding on things.
I can also potentially also compare these files and their riboviz results to Drosophila files generated by the .Rmd
file sent to @FlicAnderson by @john-s-f on 16th Dec 2020 following one of the NSF team meetings, but I haven't done anything with that file just yet due to other priorities and festive holidays.
Update:
Using @IsabelBirds' Drosophila melanogaster filtered transcript-centric annotation files and subsampled fastq file, I ran the dataset on Eddie via nextflow, and it completed. Plot outputs a bit odd, some blank, and the number of reads retained to the end (e.g check read_counts.tsv) was quite low, but ran completely and successfully!
Ran full-size fastq file and filtered transcript-centric annotation files on Eddie via nextflow, and job ran to completion!
Run command:
# run nextflow:
nextflow run prep_riboviz.nf -params-file A-Dm_2014/Aspden_2014_RPF_fullsize_config.yaml -work-dir A-Dm_2014/work -ansi-log false
Yaml used:
[fanders6@login01(eddie) riboviz]$ cat ~/riboviz/riboviz/20210107_fullsize_A-Dm_2014/Aspden_2014_RPF_fullsize_config.yaml
provenance:
yaml authors:
- author: Felicity Anderson
email: Felicity.Anderson@ed.ac.uk
- author: Isabel Birds
email:
website:
date run: 2021-01-07
riboviz-version: 2.0 | COMMIT 0f4e932d8c7de032d66f68048989b182730e7d49
GEO: GSE60384
reference: Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq, by Aspden et al. 2014
DOI: https://doi.org/10.7554/eLife.03528
notes:
adapters: CTGTAGGCACCATCAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC # Illumina sequencing adapter(s) to remove
aligner: hisat2 # Short read aligner to use. Currently only hisat2 works
asite_disp_length_file: ../../riboviz/riboviz/data/yeast_standard_asite_disp_length.txt # Table of fixed A-site positions by read length
buffer: 250 # Length of flanking region around the CDS
build_indices: TRUE # Build indices for aligner? if TRUE, remake indices from fasta files
codon_positions_file: NULL
count_reads: TRUE # Scan input, temporary and output files and produce counts of reads in each FASTQ, SAM, and BAM file processed?
count_threshold: 64 # Remove genes with a read count below this threshold, when generating statistics and figures
cmd_file: run_riboviz_A-Dm_2014.sh # Bash commands file
dataset: 20210107_fullsize_A-Dm_2014 # Dataset name
dedup_stats: FALSE # Output UMI deduplication statistics?
dedup_umis: FALSE # Extract UMIs and deduplicate reads if TRUE
dir_index: 20210107_fullsize_A-Dm_2014/index # Built indices directory
dir_in: 20210107_fullsize_A-Dm_2014/input # Input directory
dir_logs: 20210107_fullsize_A-Dm_2014/logs # Log files directory
dir_out: 20210107_fullsize_A-Dm_2014/output # Output directory
dir_tmp: 20210107_fullsize_A-Dm_2014/tmp # Intermediate files directory
do_pos_sp_nt_freq: FALSE # Calculate position-specific nucleotide frequency?
extract_umis: FALSE # Extract UMIs if TRUE
features_file: NULL # Features to correlate with ORFs
fq_files: # fastq files to be processed
#subsampledSmlPolysmFtpt: SRR1548656_100k_reads_subsampled.fastq
SmlPolysmFtpt: SRR1548656.fastq.gz # GSM1477470: small polysome footprint
group_umis: FALSE # Summarise UMI groups before and after deduplication, if TRUE
is_riboviz_gff: TRUE # Does the GFF file contain 3 elements per gene - UTR5, CDS, and UTR3
is_test_run: FALSE # Is this a test run
make_bedgraph: TRUE # Output bedgraph files, as TSV, in addition to h5?
max_read_length: 50 # Maximum read length in H5 output
min_read_length: 10 # Minimum read length in H5 output
multiplex_fq_files: NULL # Multiplexed fastq files to be processed, relative to dir_in
num_processes: 8 # Number of processes to parallelize over
orf_fasta_file: ../../riboviz/example-datasets/animalia/drosophila/annotation/Dmel_filtered.fasta # ORF file to align to
orf_gff_file: ../../riboviz/example-datasets/animalia/drosophila/annotation/Dmel_filtered_UTRreplace.gff3 # GFF2/GFF3 file for ORFs
orf_index_prefix: Dmel_filtered # ORF index file prefix, relative to dir_index
primary_id: Name # Primary gene IDs to access the data (YAL001C, YAL003W, etc.)
rpf: TRUE # Is the dataset an RPF or mRNA dataset?
rrna_fasta_file: ../../riboviz/example-datasets/animalia/drosophila/contaminants/Drosophila_melanogaster_fly_rRNA_tRNA.fasta # rRNA file to avoid aligning to
rrna_index_prefix: A-Dm_2014_rRNA # rRNA index file prefix, relative to dir_index
sample_sheet: NULL # Sample sheet, TSV file with, at least, SampleID and TagRead (barcode) columns
secondary_id: NULL # Secondary gene IDs to access the data (COX1, EFB1, etc.)
stop_in_cds: TRUE # Are stop codons part of the CDS annotations in GFF?
t_rna_file: NULL ######### # tRNA estimates (.tsv)
umi_regexp: NULL # UMI-tools-compliant regular expression to extract UMIs
Job Submitted: A-Dm_2014; fullsize fastq.gz; stop_in_cds:TRUE; sharedmem 32; h_vmem 256; ringfenced submission area.
[fanders6@login01(eddie) ~]$ qsub -pe sharedmem 32 -l h_vmem=256G -q *@@ringfence_wallace LogfileStargate/run_A-Dm_2014_fullsize.sh
Your job 9216790 ("A-Dm_32-256_full_stpcdsTRU") has been submitted
Checked the .o
output log for the job: looks good.
[fanders6@login01(eddie) ~]$ cat A-Dm_32-256_full_stpcdsTRU-9216790-node3g10.ecdf.ed.ac.uk.o
Running riboviz on dataset: 20210107_fullsize_A-Dm_2014
now in folder: /home/fanders6/riboviz/riboviz ready to run
N E X T F L O W ~ version 20.04.1
Launching `prep_riboviz.nf` [intergalactic_swartz] - revision: e13128e34f
Validating configuration only
Validated configuration
N E X T F L O W ~ version 20.04.1
Launching `prep_riboviz.nf` [fervent_meitner] - revision: e13128e34f
[93/bb82c4] Submitted process > buildIndicesORF (Dmel_filtered)
[fc/281946] Submitted process > buildIndicesrRNA (A-Dm_2014_rRNA)
[95/54e3e7] Submitted process > cutAdapters (SmlPolysmFtpt)
[29/17a4de] Submitted process > hisat2rRNA (SmlPolysmFtpt)
[80/db67d3] Submitted process > hisat2ORF (SmlPolysmFtpt)
[1f/29f3f2] Submitted process > trim5pMismatches (SmlPolysmFtpt)
[45/6954a8] Submitted process > samViewSort (SmlPolysmFtpt)
[bf/2d4ff7] Submitted process > outputBams (SmlPolysmFtpt)
[59/886dc1] Submitted process > makeBedgraphs (SmlPolysmFtpt)
[c4/ee363d] Submitted process > bamToH5 (SmlPolysmFtpt)
[1d/575777] Submitted process > generateStatsFigs (SmlPolysmFtpt)
Finished processing sample: SmlPolysmFtpt
[dd/1bb9d0] Submitted process > renameTpms (SmlPolysmFtpt)
[ee/c73dcd] Submitted process > collateTpms (SmlPolysmFtpt)
[0d/786ed1] Submitted process > countReads
Workflow finished! (OK)
WARN: Can't update history file: .nextflow/history
nextflow riboviz 20210107_fullsize_A-Dm_2014 data run complete
Job completion email info (has handy cluster run/allocation details):
Job 9216790 (A-Dm_32-256_full_stpcdsTRU) Complete User = fanders6 Queue = eddie@node3g10.ecdf.ed.ac.uk Host = node3g10.ecdf.ed.ac.uk Start Time = 01/07/2021 21:05:05.016 End Time = 01/07/2021 22:13:40.450 User Time = 02:45:17 System Time = 00:42:42 Wallclock Time = 01:08:35 CPU = 03:27:59 Max vmem = 75.287G Max rss = NA Exit Status = 0
Checked disk usage of the dataset folder:
[fanders6@login01(eddie) 20210107_fullsize_A-Dm_2014]$ du -h
273M ./output/SmlPolysmFtpt
273M ./output
0 ./tmp/SmlPolysmFtpt
0 ./tmp
4.0K ./index
4.5G ./input
3.7G ./work/80/db67d34d5d589b38d02e75099ac12a
3.7G ./work/80
26M ./work/59/886dc1490f707559803c0e3c7964f9
26M ./work/59
640K ./work/dd/1bb9d096037877bb09d271c7d3ea53
640K ./work/dd
119M ./work/c4/ee363d489808ba17facbf572a82f25
119M ./work/c4
2.8M ./work/1d/575777f3ce4706136afc668107adac
2.8M ./work/1d
7.7M ./work/fc/281946968014d2fa06c5b3c7c114a6
7.7M ./work/fc
640K ./work/0d/786ed11d2ccabd6f1ff3c1beb1232a
640K ./work/0d
20G ./work/29/17a4de00affb392cc51dc8b38bf6d1
20G ./work/29
128M ./work/bf/2d4ff78cff7a2af56e2008e6d06126
128M ./work/bf
128M ./work/45/6954a86fb10f5229a46f123d5a1d67
128M ./work/45
1.2G ./work/1f/29f3f259686332bde97f0aba7a261f
1.2G ./work/1f
640K ./work/ee/c73dcd81af3f415c00d53940efd049
640K ./work/ee
12G ./work/95/54e3e7677f93be621dea6f6b07daf3
12G ./work/95
157M ./work/93/bb82c40b555344d895b77c724bb943
157M ./work/93
37G ./work
42G .
273Mb output doesn't seem very big tho? :s
Checked read_counts.tsv
:
[fanders6@login01(eddie) output]$ cat read_counts.tsv
# Created by: RiboViz
# Date: 2021-01-07 22:13:36.720781
# Command-line tool: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/count_reads.py
# File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/count_reads.py
# Version: commit 35ee466e45a415f5bee23e90b2d864f972fe7285 date 2020-11-03 07:14:12-08:00
SampleName Program File NumReads Description
SmlPolysmFtpt input /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/input/SRR1548656.fastq.gz 99973197 input
SmlPolysmFtpt cutadapt /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/trim.fq 98802201 Reads after removal of sequencing library adapters
SmlPolysmFtpt hisat2 /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/nonrRNA.fq 24023066 rRNA or other contaminating reads removed by alignment to rRNA index files
SmlPolysmFtpt hisat2 /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/rRNA_map.sam 98802201 Reads with rRNA and other contaminating reads removed by alignment to rRNA index files
SmlPolysmFtpt hisat2 /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/unaligned.fq 19366056 Unaligned reads removed by alignment of remaining reads to ORFs index files
SmlPolysmFtpt hisat2 /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/orf_map.sam 6333028 Reads aligned to ORFs index files
SmlPolysmFtpt riboviz.tools.trim_5p_mismatch /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/orf_map_clean.sam 6332948 Reads after trimming of 5' mismatches and removal of those with more than 2 mismatches
100 million -> 6 million... big reduction!
TPMs collated:
[fanders6@login01(eddie) output]$ head TPMs_collated.tsv
# Created by: RiboViz
# Date: 2021-01-07 21:37:54
# File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/rscripts/collate_tpms.R
# Version: commit 35ee466e45a415f5bee23e90b2d864f972fe7285 date 2020-11-03 15:14:12 GMT
ORF SmlPolysmFtpt
128up-RA 63.9
14-3-3epsilon-RA 2049.7
14-3-3zeta-RD 1402.6
140up-RA 2.8
18w-RA 42.6
[fanders6@login01(eddie) output]$ tail TPMs_collated.tsv
zf30C-RA 62.4
zfh1-RE 155.5
zfh2-RB 0.2
zip-RB 0
zld-RD 0.3
zormin-RE 0
zpg-RA 0
zuc-RA 8.6
zyd-RI 0
zye-RA 2.5
tpms.tsv
from output/SmlPolysmFtpt
:
[fanders6@login01(eddie) SmlPolysmFtpt]$ head -n 20 tpms.tsv
# Created by: RiboViz
# Date: 2021-01-07 21:37:51
# File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/rscripts/generate_stats_figs.R
# Version: commit 35ee466e45a415f5bee23e90b2d864f972fe7285 date 2020-11-03 15:14:12 GMT
ORF readcount rpb tpm
128up-RA 224 0.194107452339688 63.9383798200976
14-3-3epsilon-RA 5202 6.22248803827751 2049.66784542179
14-3-3zeta-RD 3381 4.25818639798489 1402.63310850463
140up-RA 7 0.00840336134453781 2.76804060295591
18w-RA 544 0.129369797859691 42.6140015393469
26-29-p-RA 343 0.202121390689452 66.5781457223637
2mit-RA 47 0.0135329686150302 4.45771698598687
312-RA 82 0.10608020698577 34.9424840926827
4E-T-RB 0 0 0
5-HT1A-RB 8 0.00313479623824451 1.03259194906506
5-HT1B-RA 1 0.000526038926880589 0.173275555892558
5-HT2A-RA 0 0 0
5-HT2B-RE 11 0.00380491179522657 1.25332589044251
5-HT7-RA 0 0 0
5PtaseI-RB 16 0.0128 4.21627944642244
A huge step forward!
There does seem to be an issue in the outputs though: the 0 vs highest peaks of read counts aren't matching, so there's potentially an off-by- error, which may be due to stopinCDS being set to TRUE, as opposed to FALSE, which we usually run? Need to investigate this further!
Need to arrange a new meeting time with @IsabelBirds to chat through the output and pick it apart and see if it's what's expected!
@amandamok re: hackathon day 2 meeting request:
Was it just the .SAM
files you needed from this dataset run?
I can send you:
orf_map.sam
- 1.2 GB - (reads aligned to ORFs index files)orf_map_clean.sam
- 1.2 GB - (reads after trimming of 5' mismatches and removal of those with more than 2 mismatches)
... depending on what you need?@FlicAnderson I took a look at footprint densities around the start and stop codons to determine A site offset rules.
Output looks quite different to what I normally expect from yeast datasets (below):
Diagnostic plots generated from orf_map.sam
(similar for orf_map_clean.sam
) below:
I suspect the difference in footprint lengths is due to a difference in the experimental protocol: "We purified mRNAs in small polysomes, away from monosomes (80S), ribosomal subunits (40S, 60S), and large polysomes." (taken from Methods section of Aspden et al. (2014).
P site assignment in Apsden et al. (2014) was performed as: "For a given open reading frame, the corresponding P-site position of filtered RPF reads (28–32 nt) was designated as follows: +12 offset for 28 and 29 nt, +13 for 30 to 31 nt, and +14 for 32 nt RPF (Chew et al., 2013; Bazzini et al., 2014)." This would correspond to A site assignments as +15 for 28 and 29 nt, +16 for 30 to 31 nt, and +17 for 32 nt RPF; and also correspondingly the distance between read A sites and 3' ends would be: 10 nt for 28 nt RPFs, 11 nt for 29 and 30 nt RPFs, and 12 for 31 and 32 nt RPFs. However, I don't see much support for those assignment rules at the start and stop codons. Not sure how to proceed with A site assignment, given these diagnostic plots ...
These are cool plots! Definitely has me wondering if there's a problem with the adapter sequence or trimming.
I think it's time to set up a call with @IsabelBirds and Julie to discuss.
These are the PDF output plots riboviz generates for the dataset: 3ntframe_propbygene.pdf
We met with Julie, @IsabelBirds , and @FlicAnderson on 15th Jan. Julie agrees that these outputs look consistent with her previous analyses, reflecting an early ribosome profiling dataset with poor frame information.
check_gff_fasta.py
If that all looks good, we can close this issue ticket and move on to a new issue, with a recent and better Drosophila dataset, to compare framing analysis.
Merged in pull request #44 to add Isabel's provenance file to the drosophila-aspden-10
branch.
Still to do: Double-check .fa/.gff3 with check_gff_fasta.py Check gene-wise outputs (TPMs) against a read density file from Julie, by scatter plot
Julie has sent me the relevant file for checking the outputs, so it remains to do the plotting & checking Also I can run the shiny new check-fasta-gff code (https://github.com/riboviz/riboviz/pull/270) over these .fa & .gff3 files.
Following on from @ewallace 's request to allow him to compare TPMs to read density information from Julie Aspden (email of 18th Jan 2021, also sent to @ewallace), these are the TPMs requested. Drosophila_tpms.zip
This looks good. I compared tpms from riboviz analysis with rpkms (differently normalized but same idea) from GEO:GSE60384, because the version on GEO already had the same ORF/gene ids.
There's a lot of agreement - most genes have highly correlated values across 4 orders of magnitude.
It would be nice to spot-check ORFs with big differences and see if there's any obvious systematic reason for differential counting (read length, mitochondrial genes, etc). @IsabelBirds could you have a look?
The code used to join and generate the plots, in R markdown format, is attached Drosophila_ORF_output_check_smallpolysomes.Rmd.txt. That should be useful for follow-up work (change file extension back to .txt to run it in Rstudio).
I think we can close this issue, it's successful.
As a final check, here's the output for the latest version of check_fasta_gff
from develop
branch:
[fanders6@login01(eddie) ~]$ cat gff_fa_A-Dm_2014-9920713-node3g10.ecdf.ed.ac.uk.o
Running check_fasta_gff on transcriptome files from dataset: A-Dm_2014
now in folder: /home/fanders6/riboviz/riboviz ready to run
Created by: RiboViz
Date: 2021-02-18 19:31:08.055514
Command-line tool: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
Version: commit 7e62c9f8da4f66f9c18aca30394f44fa4a7fa1ef date 2021-01-20 10:18:00-08:00
Configuration:
fasta_file A-Dm_2014/Dmel_filtered.fasta
gff_file A-Dm_2014/Dmel_filtered_UTRreplace.gff3
start_codons ['ATG']
Issue summary:
Issue Count
NoStopCodon 6953
NoStartCodon 6948
InternalStopCodon 6596
IncompleteFeature 33
SequenceNotInGFF 0
SequenceNotInFASTA 0
MultipleCDS 0
DuplicateFeatureId 0
NoIdName 0
DuplicateFeatureIds 0
check_fasta_gff for A-Dm_2014 complete
And the .tsv output file with further details (saved as .txt to allow upload here): features_issues.txt
I've not checked what the total number of genes is 'properly' by loading the .gff into anything, but if we say that each gene has ~ 1 x each of 5' UTR, CDS & 3' UTR, and we subtract the first 3 rows of comments in the gff file and remove this from total number of lines in the file (41,487), then this gives about ~41,484
features of any kind from ~ 13,828
genes.
Given that 41k estimate of the number of features, does having ~7k errors/issues for each of the codon-related issues seem reasonable or high? @IsabelBirds @ewallace
(I realise you might not have a chance to check this if you're busy @IsabelBirds, just tagging you in in case you have time for a quick opinion, and also for future info)
Knowing about this (check_fasta_gff.py
) might also be useful for @3mma-mack since she's putting together a transcriptome for a different dataset.
@FlicAnderson I'm so glad you checked that. Yes, we should run check_fasta_gff.py
on all new transcriptome sets.
It looks like all the transcripts on the minus strand are wrong in that they've been included in reverse complement. I came to this conclusion because:
features_issues.txt
file, Ir21a-RB
and Cda5-RA
are on the minus strand, have TTA
as start codon which is reverse complement of TAA
, and CAT
as stop which is reverse complement of ATG
So we need to fix this by making a new transcriptome fasta / gff that appropriately takes the reverse complement of transcripts on the minus strand.
Arranged a meeting with Isabel (26th Mar) where we'll come up with a plan for next-steps on fixing the transcriptome files.
@FlicAnderson I think I've fixed this - the transcripts on the minus strand should all be reverse complemented now. https://github.com/IsabelBirds/Stuff_to_share/tree/main/Riboviz_work/2_Processed_data/Dmel_genomes - filtered versions for one CDS per gene.
I'll update the provenance file once we're happy it's sorted :)
Hi @IsabelBirds - sorry it's taken so long to check this, I've run these new files through check-fasta-gff again and here's the new output:
[fanders6@login02(eddie) ~]$ cat gff_fa_A-Dm_2014-11333517-node3g10.ecdf.ed.ac.uk.o
Running check_fasta_gff on transcriptome files from dataset: A-Dm_2014
now in folder: /home/fanders6/riboviz/riboviz ready to run
Created by: RiboViz
Date: 2021-04-19 14:07:55.957977
Command-line tool: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
Version: commit 5f23a69d0dee1a2917440ce9eec86eadcd195eda date 2021-04-15 09:11:14-07:00
Configuration:
fasta_file A-Dm_2014/Dmel_filtered.fasta
gff_file A-Dm_2014/Dmel_filtered_UTRreplace.gff3
start_codons ['ATG']
Issue summary:
Issue Count
SequenceNotInFASTA 4086
InternalStopCodon 55
NoStopCodon 27
IncompleteFeature 24
NoStartCodon 11
NoIdName 0
SequenceNotInGFF 0
DuplicateFeatureId 0
DuplicateFeatureIds 0
MultipleCDS 0
check_fasta_gff for A-Dm_2014 complete
There's still quite a lot of sequences in the GFF which don't seem to be in the FASTA (if I'm correctly interpreting SequenceNotInFASTA
), but in terms of the number of sequences with issues (e.g. no start/stop codon), it's far far lower and I think this is looking much better.
Do you have any ideas on why SequenceNotInFASTA
might be quite high?
If you think this is an acceptable level of issues, I can proceed by re-running the full dataset with these files so we can double-check what the output plots look like now.
Thanks @FlicAnderson - I'll take a look this week and see why we're getting so many SequenceNotInFasta - not sure why that's coming up!
We would like to add Drosophila melanogaster data to work with Julie Aspden and lab, starting with data from Extensive translation of small Open Reading Frames, data in GEO:GSE60384.
Requires some steps:
drosophila-aspden-10
, and work in that.- [ ] figure out how to deal with bicistronic transcripts.[EDIT: agreed in Jan 2021 this should be approached on another issue ticket!]Requesting help from @IsabelBirds in the Aspden lab.