Add Aspden lab Drosophila example dataset

ewallace commented 4 years ago

We would like to add Drosophila melanogaster data to work with Julie Aspden and lab, starting with data from Extensive translation of small Open Reading Frames, data in GEO:GSE60384.

Requires some steps:

[x] create new branch drosophila-aspden-10, and work in that.
[x] add contaminants fasta file with rRNA and tRNA from flybase/ribogalaxy
[ ] create transcriptome (or ORFeome) gff and fasta annotation files, details. This is likely to be the pain point.
[x] create a subsampled example fastq file for faster debugging, using riboviz.subsample_bioseqfile.py
[x] write a config.yaml file with all parameters needed including adapter and multiplexing information
[x] test and troubleshoot on subsampled fastq file
[ ] test on full-size
[ ] compare QC and substantial outputs

~~- [ ] figure out how to deal with bicistronic transcripts.~~ [EDIT: agreed in Jan 2021 this should be approached on another issue ticket!]

Requesting help from @IsabelBirds in the Aspden lab.

FlicAnderson commented 4 years ago

Meeting with Isabel organised for Friday 02 Oct ~10:30am to discuss the dataset, and ask/answer any basic set-up questions.

FlicAnderson commented 4 years ago

Edited initial comment to fix the check-boxes, since they're super-handy.

FlicAnderson commented 4 years ago

I've created branch drosophila-aspden-10 and added some file structure within that:

example-datasets/

animalia/
- drosophila/
  - contaminants/
  - annotation/

Within contaminants, I've added: 1) Drosophila_melanogaster_fly_rRNA.fasta This file was created as follows: The Drosophila melanogaster 'D_mel_rRNA.fasta' file for rRNA was downloaded from Ribogalaxy Data Library Original Filename: D_mel_rRNA.fasta Downloaded from: https://ribogalaxy.ucc.ie/library_common/download_dataset_from_folder?library_id=03501d7626bd192f&cntrller=library&use_panels=False&id=52e496b945151ee8 File Information: https://ribogalaxy.ucc.ie/library_common/ldda_info?library_id=03501d7626bd192f&show_deleted=False&cntrller=library&folder_id=03501d7626bd192f&use_panels=False&id=52e496b945151ee8 The file was then moved and renamed. Current Filename: Drosophila_melanogaster_fly_rRNA.fasta

2) Drosophila_melanogaster_fly_tRNA_r6-35.fasta This file was created as follows: The Drosophila melanogaster genome release 'dmel_r6.35' file for tRNA was downloaded from Flybase.net Original Filename:dmel-all-tRNA-r6.35.fasta.gz Downloaded from: ftp://ftp.flybase.net/releases/FB2020_04/dmel_r6.35/fasta/dmel-all-tRNA-r6.35.fasta.gz The file was then unzipped and the fasta file moved and renamed. Current Filename: Drosophila_melanogaster_fly_tRNA_r6-35.fasta

3) 2x _provenance.txt files; one for each of those files, containing the information I've added here.

:question: Given that riboviz takes a single contaminants file, I guess that there might be a bit of work involved in concatonating both files into one? Is this correct @ewallace ? Hopefully their formats will be similar, but that's something I can check.

:question: I'll definitely check in with Isabel tomorrow to see if these are the best files to use for rRNA and tRNA, or if I've pulled out the wrong ones.

:question: There's an overwhelming amount of different options for fasta/gff files on the genome release on flybase.net, but not rRNA - did I overlook one? Have got the Ribogalaxy data library version, but I'll need to do more work to find out where that came from.

FlicAnderson commented 4 years ago

@isabelbirds Tagging you in - hope you get this!

IsabelBirds commented 4 years ago

Notes from chat:

[x] tRNA and rRNA correct files.
riboviz is transcriptome/canonical ORFs only at the moment.
Does have an option to include RNA-seq, not tested recently. This may be helpful for resolving multi-mapping reads, allows to check overlap of splice sites. Other software tackles this by removing ORFs that have a given % only supported by multimapping coverage.

❓ @FlicAnderson by buffer region do you mean 3' and 5' UTR lengths? If so, solution could be to use the 3/5'UTR fastas from Flybase to get buffer length?

ORFeome files:

gff
- ftp://ftp.flybase.net/releases/FB2020_04/dmel_r6.35/gff/dmel-all-r6.35.gff.gz
- Contains all chromosome arms.
fasta file
- ftp://ftp.flybase.net/releases/FB2020_04/dmel_r6.35/fasta/dmel-all-CDS-r6.35.fasta.gz
- The contiguous protein coding sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.

Non ATG starts:

From FlyBase wiki - “Non-ATG starts and translational frameshifts: These require a high level of support, such as detailed treatment in the literature or unambiguous homology data. There are currently 11 genes annotated with a non-ATG start and one with a translational frameshift (Oda). To date, all non-canonical starts vary from ‘AUG’ by one nucleotide.”

Dicistronic transcripts:

385 annotated in current D. melanogaster genome.
J Aspden previously validated 3 of these, also found using these data.

Ribo/RNA-seq data:

28–34 nt ribosome footprints
50–80 nt mRNA fragments
Gel purified and prepared as previously described (Ingolia et al., 2009, 2011, 2012) for Next Generation Sequencing. Libraries were sequenced on Illumina HiSeq2000 and MiSeq machines with 50 bp SingleEnd read protocol.
Adapter sequence - CTGTAGGCACCATCAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

To do:

[x] Check barcode and UMI q.

IsabelBirds commented 4 years ago

Update - Aspden dataset doesn't contain UMIs.

FlicAnderson commented 4 years ago

@IsabelBirds Thanks for checking the UMI stuff and getting those other details back to me. Super useful!

RE: buffer region, we describe it as "Length of flanking region around the CDS" and we've treated this as UTR length, yep. But, we've been presently using 1 value for this (E.g. using '250' for S.cerevisiae to match our current 'best guess' annotation files.)

In terms of using the 3' and 5' UTR lengths from the Flybase annotation files, that's a good shout but I don't think that would work in our code as-is, since we're expecting this one value (applying it on both sides, and for everything!).

I'm not 100% clear what our existing plans are for using specific UTR lengths for each gene, as opposed to uniformly applying this one generic value, which is where we currently are. @ewallace is this something you have any ideas on?

IsabelBirds commented 4 years ago

@FlicAnderson I was thinking that using 3' and 5' UTR annotations might actually work as good generic fix, not necessarily to get specific lengths for each transcript- unless it's a value that doesn't actually make that much difference to results.

i.e. for a given species take these files, plot distribution of UTR lengths to check for any weirdness/outliers, and take the overall average for the buffer length value. Removes some of the guessing if you're using a public dataset?

FlicAnderson commented 4 years ago

@IsabelBirds Oooh, that's a good idea. I was thinking about doing that if we needed a sort of temporary number to work with in the short term.

ewallace commented 4 years ago

I'm not 100% clear what our existing plans are for using specific UTR lengths for each gene, as opposed to uniformly applying this one generic value, which is where we currently are. @ewallace is this something you have any ideas on?

Riboviz itself doesn't care how you choose UTR lengths, whether biologically realistic or not. It currently will quantify one ORF per sequence supplied in the "transcript fasta file", and the ORF positions are described in the gff. This could be a canonical ORF with annotated UTRs from data, or any ORF you want with any buffer size you want, e.g. minimally a 18nt-padding at each and to keep track of fragments at start and stop codons. So which makes sense here depends on ease of setup and the scientific question.

I'd suggest first getting it working on Drosophila for just canonical ORFs with fixed-length padding if that is easier. If you have an annotation with "real" UTR lengths and it's easier to do that, it's of course better. Then after we run on full-length real transcripts, we can figure out how to adjust the code to quantify multiple ORFs per transcript, which is issue riboviz#225.

FlicAnderson commented 4 years ago

Tagging in the issue riboviz/#83 because this is a key thing to be able to communicate to users :) @ewallace will consider if there's a better naming option

FlicAnderson commented 4 years ago

Also a reminder to self to create some downsampled data to work with FIRST, to get things working asap and help troubleshooting.

FlicAnderson commented 4 years ago

Concatenated the rRNA and tRNA files into one contaminants file: Drosophila_melanogaster_fly_rRNA_tRNA.fasta and added Drosophila_melanogaster_fly_rRNA_tRNA_provenance.txt with details on how it was created and which rRNA and tRNA files were used.

FlicAnderson commented 3 years ago

@FlicAnderson to make a plan on how to approach getting the GFF and Fasta files into riboviz format.

Also worth me setting this dataset up on Eddie so I can run things on there, rather than working from my local machine.

FlicAnderson commented 3 years ago

Worth noting that running check_fasta_gff.py locally took a long time, which is why I had started setting up to do this on Eddie, freeing up my comp for other things. I ran wget to download both files into a folder on my drive, since I don't expect them to increase in size just from running check_fasta_gff.py :grin: I'll obviously put them into a sensible place when I attempt to run them.

I don't know if it's worthwhile committing the fasta and gff files to the example-datasets repo until I know more about the formatting and how it works or doesn't work with riboviz, because presumably that would add size to the repo unnecessarily.

I also need to subsample the fastq file to get something smaller to run on. This is something that I could probably do reasonably quickly once I concentrate on doing that. This might be a task for Tuesday afternoon before/after the dev meeting (03rd Nov).

Planning to work on the fasta and gff formats on afternoon Wednesday 04th Nov, ahead of next meeting with Isabel on Thurs 05th at 1600.

IsabelBirds commented 3 years ago

@FlicAnderson from what I can tell, the current gff formatting script creates buffers directly up and downstream of the CDS, i.e. regardless of intron/exon structure. Is this approach ok or do I need to ensure that I'm only pulling buffer from exon sequence? It is a lot more straightforward to carry on like this, but not sure how this would affect the rest of the pipeline.

IsabelBirds commented 3 years ago

Also following on from what @ewallace said, for the majority of ORFs it is easier to pull the canonical UTRs and add buffer for any ORFs without UTRs/with particularly short UTRs. However some annotations (gencode 👀) don't label which is 3 and which is 5 prime so only true for most annotations.

Also also, do we want to start with a script that pulls one ORF per transcript, or start with the goal of all canonical ORFs? Can always filter the result so you have something to work on for now?

FlicAnderson commented 3 years ago

Hi @IsabelBirds, I think if you're able to get anything together that I can start testing, that would be great. Maybe start with one ORF per transcript for now? and in terms of buffer, whatever is easiest to grab currently is the best place to start.

There'll be a lot of downstream tinkering required I think, so if you're able to get a transcript-centric fasta and gff file to me as soon as you can, I can crack on with finding the next set of issues in the pipeline, and then go back and figure out the 'best' way to set these files up once we know what else will be required to get them running.

Does this help answer your questions at all?

FlicAnderson commented 3 years ago

Created Downsampled Fastq File

From: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1477470 I downloaded the "small polysome footprints" RPF sample via wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/006/SRR1548656/SRR1548656.fastq.gz then unzipped it.

I ran subsample_bioseqfile.py (from subsample-fastq-38 branch, currently a PR) on it to get a subsampled version with only 100,000 reads (instead of the full ~100 million) via :

$ python ~/riboviz/riboviz/pyscripts/subsample_bioseqfile.py -in SRR1548656_copy.fastq -prob  0.001 -out SRR1548656_subsampled.fastq -ftype fastq
subsampling complete; read 99973197 records from SRR1548656_copy.fastq, wrote 100042 records to SRR1548656_subsampled.fastq

... and after checking it contained ~100000 reads, I renamed it as "SRR1548656_100k_reads_subsampled.fastq".

FlicAnderson commented 3 years ago

Just pushed a slight update to the config .yaml, including the subsampled fastq file so it's ready to try running once the custom fasta and gff files are in place. Also pushed commit where I'd added the adapter in, since I realised this hadn't been pushed yet.

FlicAnderson commented 3 years ago

Have another meeting with @IsabelBirds tomorrow (Thursday 03rd) after a week where I've been focussing on other tasks and not progressed this one yet.

IsabelBirds commented 3 years ago

Hi @FlicAnderson, have uploaded:

D.mel ORFeome GFF and fasta files (these have been updated - had some rogue spaces!)
D.mel ORFeome GFF and fasta files filtered to one CDS per gene
D.mel GFF and fasta prep and filtering scripts. This also works for the mouse gencode annotation w/o any changes. It isn't general enough to work with S288C yet - needs fasta ID editing and genome padding steps.

All stashed here

Also note - have used standard five_prime_UTR and three_prime_UTR notation, not 3UTR/5UTR

FlicAnderson commented 3 years ago

Next meeting with @IsabelBirds is on Thurs 07th Jan (4PM UK time), and I'll be looking through the files she's uploaded and trying to run riboviz on them in the morning beforehand to be able to ask any questions or double-check my understanding on things.

I can also potentially also compare these files and their riboviz results to Drosophila files generated by the .Rmd file sent to @FlicAnderson by @john-s-f on 16th Dec 2020 following one of the NSF team meetings, but I haven't done anything with that file just yet due to other priorities and festive holidays.

FlicAnderson commented 3 years ago

Update:

[x] test and troubleshoot on subsampled fastq file

Using @IsabelBirds' Drosophila melanogaster filtered transcript-centric annotation files and subsampled fastq file, I ran the dataset on Eddie via nextflow, and it completed. Plot outputs a bit odd, some blank, and the number of reads retained to the end (e.g check read_counts.tsv) was quite low, but ran completely and successfully!

[x] test on full-size

Ran full-size fastq file and filtered transcript-centric annotation files on Eddie via nextflow, and job ran to completion!

Run command:

# run nextflow: 
nextflow run prep_riboviz.nf -params-file A-Dm_2014/Aspden_2014_RPF_fullsize_config.yaml -work-dir A-Dm_2014/work -ansi-log false

Yaml used:

[fanders6@login01(eddie) riboviz]$ cat ~/riboviz/riboviz/20210107_fullsize_A-Dm_2014/Aspden_2014_RPF_fullsize_config.yaml 
provenance:
  yaml authors:
  - author: Felicity Anderson
    email: Felicity.Anderson@ed.ac.uk
  - author: Isabel Birds
    email: 
  website:
  date run: 2021-01-07
  riboviz-version: 2.0 | COMMIT 0f4e932d8c7de032d66f68048989b182730e7d49
  GEO: GSE60384
  reference: Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq, by Aspden et al. 2014
  DOI: https://doi.org/10.7554/eLife.03528
  notes:

adapters: CTGTAGGCACCATCAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC # Illumina sequencing adapter(s) to remove
aligner: hisat2 # Short read aligner to use. Currently only hisat2 works
asite_disp_length_file: ../../riboviz/riboviz/data/yeast_standard_asite_disp_length.txt # Table of fixed A-site positions by read length
buffer: 250 # Length of flanking region around the CDS
build_indices: TRUE # Build indices for aligner? if TRUE, remake indices from fasta files
codon_positions_file: NULL
count_reads: TRUE # Scan input, temporary and output files and produce counts of reads in each FASTQ, SAM, and BAM file processed?
count_threshold: 64 # Remove genes with a read count below this threshold, when generating statistics and figures
cmd_file: run_riboviz_A-Dm_2014.sh # Bash commands file
dataset: 20210107_fullsize_A-Dm_2014 # Dataset name
dedup_stats: FALSE # Output UMI deduplication statistics?
dedup_umis: FALSE # Extract UMIs and deduplicate reads if TRUE
dir_index: 20210107_fullsize_A-Dm_2014/index # Built indices directory
dir_in: 20210107_fullsize_A-Dm_2014/input # Input directory
dir_logs: 20210107_fullsize_A-Dm_2014/logs # Log files directory
dir_out: 20210107_fullsize_A-Dm_2014/output # Output directory
dir_tmp: 20210107_fullsize_A-Dm_2014/tmp # Intermediate files directory
do_pos_sp_nt_freq: FALSE # Calculate position-specific nucleotide frequency?
extract_umis: FALSE # Extract UMIs if TRUE
features_file: NULL # Features to correlate with ORFs
fq_files: # fastq files to be processed
 #subsampledSmlPolysmFtpt: SRR1548656_100k_reads_subsampled.fastq
 SmlPolysmFtpt: SRR1548656.fastq.gz  # GSM1477470: small polysome footprint
group_umis: FALSE # Summarise UMI groups before and after deduplication, if TRUE
is_riboviz_gff: TRUE # Does the GFF file contain 3 elements per gene - UTR5, CDS, and UTR3
is_test_run: FALSE # Is this a test run
make_bedgraph: TRUE # Output bedgraph files, as TSV, in addition to h5?
max_read_length: 50 # Maximum read length in H5 output
min_read_length: 10 # Minimum read length in H5 output
multiplex_fq_files: NULL # Multiplexed fastq files to be processed, relative to dir_in
num_processes: 8 # Number of processes to parallelize over
orf_fasta_file: ../../riboviz/example-datasets/animalia/drosophila/annotation/Dmel_filtered.fasta # ORF file to align to
orf_gff_file: ../../riboviz/example-datasets/animalia/drosophila/annotation/Dmel_filtered_UTRreplace.gff3 # GFF2/GFF3 file for ORFs
orf_index_prefix: Dmel_filtered   # ORF index file prefix, relative to dir_index
primary_id: Name # Primary gene IDs to access the data (YAL001C, YAL003W, etc.)
rpf: TRUE # Is the dataset an RPF or mRNA dataset?
rrna_fasta_file: ../../riboviz/example-datasets/animalia/drosophila/contaminants/Drosophila_melanogaster_fly_rRNA_tRNA.fasta # rRNA file to avoid aligning to
rrna_index_prefix: A-Dm_2014_rRNA # rRNA index file prefix, relative to dir_index
sample_sheet: NULL # Sample sheet, TSV file with, at least, SampleID and TagRead (barcode) columns
secondary_id: NULL # Secondary gene IDs to access the data (COX1, EFB1, etc.)
stop_in_cds: TRUE # Are stop codons part of the CDS annotations in GFF?
t_rna_file: NULL ######### # tRNA estimates (.tsv)
umi_regexp: NULL # UMI-tools-compliant regular expression to extract UMIs

Job Submitted: A-Dm_2014; fullsize fastq.gz; stop_in_cds:TRUE; sharedmem 32; h_vmem 256; ringfenced submission area.

[fanders6@login01(eddie) ~]$ qsub -pe sharedmem 32 -l h_vmem=256G -q *@@ringfence_wallace LogfileStargate/run_A-Dm_2014_fullsize.sh 
Your job 9216790 ("A-Dm_32-256_full_stpcdsTRU") has been submitted

Checked the .o output log for the job: looks good.

[fanders6@login01(eddie) ~]$ cat A-Dm_32-256_full_stpcdsTRU-9216790-node3g10.ecdf.ed.ac.uk.o
Running riboviz on dataset: 20210107_fullsize_A-Dm_2014
now in folder: /home/fanders6/riboviz/riboviz ready to run
N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [intergalactic_swartz] - revision: e13128e34f
Validating configuration only
Validated configuration
N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [fervent_meitner] - revision: e13128e34f
[93/bb82c4] Submitted process > buildIndicesORF (Dmel_filtered)
[fc/281946] Submitted process > buildIndicesrRNA (A-Dm_2014_rRNA)
[95/54e3e7] Submitted process > cutAdapters (SmlPolysmFtpt)
[29/17a4de] Submitted process > hisat2rRNA (SmlPolysmFtpt)
[80/db67d3] Submitted process > hisat2ORF (SmlPolysmFtpt)
[1f/29f3f2] Submitted process > trim5pMismatches (SmlPolysmFtpt)
[45/6954a8] Submitted process > samViewSort (SmlPolysmFtpt)
[bf/2d4ff7] Submitted process > outputBams (SmlPolysmFtpt)
[59/886dc1] Submitted process > makeBedgraphs (SmlPolysmFtpt)
[c4/ee363d] Submitted process > bamToH5 (SmlPolysmFtpt)
[1d/575777] Submitted process > generateStatsFigs (SmlPolysmFtpt)
Finished processing sample: SmlPolysmFtpt
[dd/1bb9d0] Submitted process > renameTpms (SmlPolysmFtpt)
[ee/c73dcd] Submitted process > collateTpms (SmlPolysmFtpt)
[0d/786ed1] Submitted process > countReads
Workflow finished! (OK)
WARN: Can't update history file: .nextflow/history
nextflow riboviz 20210107_fullsize_A-Dm_2014 data run complete

Job completion email info (has handy cluster run/allocation details):

Job 9216790 (A-Dm_32-256_full_stpcdsTRU) Complete User = fanders6 Queue = eddie@node3g10.ecdf.ed.ac.uk Host = node3g10.ecdf.ed.ac.uk Start Time = 01/07/2021 21:05:05.016 End Time = 01/07/2021 22:13:40.450 User Time = 02:45:17 System Time = 00:42:42 Wallclock Time = 01:08:35 CPU = 03:27:59 Max vmem = 75.287G Max rss = NA Exit Status = 0

Checked disk usage of the dataset folder:

[fanders6@login01(eddie) 20210107_fullsize_A-Dm_2014]$ du -h
273M    ./output/SmlPolysmFtpt
273M    ./output
0   ./tmp/SmlPolysmFtpt
0   ./tmp
4.0K    ./index
4.5G    ./input
3.7G    ./work/80/db67d34d5d589b38d02e75099ac12a
3.7G    ./work/80
26M ./work/59/886dc1490f707559803c0e3c7964f9
26M ./work/59
640K    ./work/dd/1bb9d096037877bb09d271c7d3ea53
640K    ./work/dd
119M    ./work/c4/ee363d489808ba17facbf572a82f25
119M    ./work/c4
2.8M    ./work/1d/575777f3ce4706136afc668107adac
2.8M    ./work/1d
7.7M    ./work/fc/281946968014d2fa06c5b3c7c114a6
7.7M    ./work/fc
640K    ./work/0d/786ed11d2ccabd6f1ff3c1beb1232a
640K    ./work/0d
20G ./work/29/17a4de00affb392cc51dc8b38bf6d1
20G ./work/29
128M    ./work/bf/2d4ff78cff7a2af56e2008e6d06126
128M    ./work/bf
128M    ./work/45/6954a86fb10f5229a46f123d5a1d67
128M    ./work/45
1.2G    ./work/1f/29f3f259686332bde97f0aba7a261f
1.2G    ./work/1f
640K    ./work/ee/c73dcd81af3f415c00d53940efd049
640K    ./work/ee
12G ./work/95/54e3e7677f93be621dea6f6b07daf3
12G ./work/95
157M    ./work/93/bb82c40b555344d895b77c724bb943
157M    ./work/93
37G ./work
42G .

273Mb output doesn't seem very big tho? :s

Checked read_counts.tsv:

[fanders6@login01(eddie) output]$ cat read_counts.tsv 
# Created by: RiboViz
# Date: 2021-01-07 22:13:36.720781
# Command-line tool: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/count_reads.py
# File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/count_reads.py
# Version: commit 35ee466e45a415f5bee23e90b2d864f972fe7285 date 2020-11-03 07:14:12-08:00
SampleName  Program File    NumReads    Description
SmlPolysmFtpt   input   /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/input/SRR1548656.fastq.gz  99973197    input
SmlPolysmFtpt   cutadapt    /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/trim.fq  98802201    Reads after removal of sequencing library adapters
SmlPolysmFtpt   hisat2  /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/nonrRNA.fq   24023066    rRNA or other contaminating reads removed by alignment to rRNA index files
SmlPolysmFtpt   hisat2  /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/rRNA_map.sam 98802201    Reads with rRNA and other contaminating reads removed by alignment to rRNA index files
SmlPolysmFtpt   hisat2  /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/unaligned.fq 19366056    Unaligned reads removed by alignment of remaining reads to ORFs index files
SmlPolysmFtpt   hisat2  /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/orf_map.sam  6333028 Reads aligned to ORFs index files
SmlPolysmFtpt   riboviz.tools.trim_5p_mismatch  /exports/eddie3_homes_local/fanders6/riboviz/riboviz/20210107_fullsize_A-Dm_2014/tmp/SmlPolysmFtpt/orf_map_clean.sam    6332948 Reads after trimming of 5' mismatches and removal of those with more than 2 mismatches

100 million -> 6 million... big reduction!

TPMs collated:

[fanders6@login01(eddie) output]$ head TPMs_collated.tsv 
# Created by: RiboViz
# Date: 2021-01-07 21:37:54
# File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/rscripts/collate_tpms.R
# Version: commit 35ee466e45a415f5bee23e90b2d864f972fe7285 date 2020-11-03 15:14:12 GMT
ORF SmlPolysmFtpt
128up-RA    63.9
14-3-3epsilon-RA    2049.7
14-3-3zeta-RD   1402.6
140up-RA    2.8
18w-RA  42.6

[fanders6@login01(eddie) output]$ tail TPMs_collated.tsv 
zf30C-RA    62.4
zfh1-RE 155.5
zfh2-RB 0.2
zip-RB  0
zld-RD  0.3
zormin-RE   0
zpg-RA  0
zuc-RA  8.6
zyd-RI  0
zye-RA  2.5

tpms.tsv from output/SmlPolysmFtpt:

[fanders6@login01(eddie) SmlPolysmFtpt]$ head -n 20 tpms.tsv 
# Created by: RiboViz
# Date: 2021-01-07 21:37:51
# File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/rscripts/generate_stats_figs.R
# Version: commit 35ee466e45a415f5bee23e90b2d864f972fe7285 date 2020-11-03 15:14:12 GMT
ORF readcount   rpb tpm
128up-RA    224 0.194107452339688   63.9383798200976
14-3-3epsilon-RA    5202    6.22248803827751    2049.66784542179
14-3-3zeta-RD   3381    4.25818639798489    1402.63310850463
140up-RA    7   0.00840336134453781 2.76804060295591
18w-RA  544 0.129369797859691   42.6140015393469
26-29-p-RA  343 0.202121390689452   66.5781457223637
2mit-RA 47  0.0135329686150302  4.45771698598687
312-RA  82  0.10608020698577    34.9424840926827
4E-T-RB 0   0   0
5-HT1A-RB   8   0.00313479623824451 1.03259194906506
5-HT1B-RA   1   0.000526038926880589    0.173275555892558
5-HT2A-RA   0   0   0
5-HT2B-RE   11  0.00380491179522657 1.25332589044251
5-HT7-RA    0   0   0
5PtaseI-RB  16  0.0128  4.21627944642244

A huge step forward!

There does seem to be an issue in the outputs though: the 0 vs highest peaks of read counts aren't matching, so there's potentially an off-by- error, which may be due to stopinCDS being set to TRUE, as opposed to FALSE, which we usually run? Need to investigate this further!

Need to arrange a new meeting time with @IsabelBirds to chat through the output and pick it apart and see if it's what's expected!

FlicAnderson commented 3 years ago

@amandamok re: hackathon day 2 meeting request: Was it just the .SAM files you needed from this dataset run?
I can send you:

orf_map.sam - 1.2 GB - (reads aligned to ORFs index files)
and/or orf_map_clean.sam - 1.2 GB - (reads after trimming of 5' mismatches and removal of those with more than 2 mismatches) ... depending on what you need?

amandamok commented 3 years ago

@FlicAnderson I took a look at footprint densities around the start and stop codons to determine A site offset rules.

Output looks quite different to what I normally expect from yeast datasets (below):

footprint lengths: 27-31 nt
distance from read 5' end to A site: 14-18 nt
distance from read A site to 3' end: 9-11 nt

Diagnostic plots generated from orf_map.sam (similar for orf_map_clean.sam) below:

I suspect the difference in footprint lengths is due to a difference in the experimental protocol: "We purified mRNAs in small polysomes, away from monosomes (80S), ribosomal subunits (40S, 60S), and large polysomes." (taken from Methods section of Aspden et al. (2014).

P site assignment in Apsden et al. (2014) was performed as: "For a given open reading frame, the corresponding P-site position of filtered RPF reads (28–32 nt) was designated as follows: +12 offset for 28 and 29 nt, +13 for 30 to 31 nt, and +14 for 32 nt RPF (Chew et al., 2013; Bazzini et al., 2014)." This would correspond to A site assignments as +15 for 28 and 29 nt, +16 for 30 to 31 nt, and +17 for 32 nt RPF; and also correspondingly the distance between read A sites and 3' ends would be: 10 nt for 28 nt RPFs, 11 nt for 29 and 30 nt RPFs, and 12 for 31 and 32 nt RPFs. However, I don't see much support for those assignment rules at the start and stop codons. Not sure how to proceed with A site assignment, given these diagnostic plots ...

ewallace commented 3 years ago

These are cool plots! Definitely has me wondering if there's a problem with the adapter sequence or trimming.

I think it's time to set up a call with @IsabelBirds and Julie to discuss.

FlicAnderson commented 3 years ago

These are the PDF output plots riboviz generates for the dataset: 3ntframe_propbygene.pdf

3ntframe_propbygene

3nt_periodicity.pdf

pos_sp_rpf_norm_reads.pdf

pos_sp_rpf_norm_reads

read_lengths.pdf

read_lengths

startcodon_ribogrid.pdf

startcodon_ribogrid

startcodon_ribogridbar.pdf

startcodon_ribogridbar

ewallace commented 3 years ago

We met with Julie, @IsabelBirds , and @FlicAnderson on 15th Jan. Julie agrees that these outputs look consistent with her previous analyses, reflecting an early ribosome profiling dataset with poor frame information.

Request @IsabelBirds to update provenance file for annotation
Double-check .fa/.gff3 with check_gff_fasta.py
Check gene-wise outputs (TPMs) against a read density file from Julie, by scatter plot.

If that all looks good, we can close this issue ticket and move on to a new issue, with a recent and better Drosophila dataset, to compare framing analysis.

FlicAnderson commented 3 years ago

Merged in pull request #44 to add Isabel's provenance file to the drosophila-aspden-10 branch.

Still to do: Double-check .fa/.gff3 with check_gff_fasta.py Check gene-wise outputs (TPMs) against a read density file from Julie, by scatter plot

Julie has sent me the relevant file for checking the outputs, so it remains to do the plotting & checking Also I can run the shiny new check-fasta-gff code (https://github.com/riboviz/riboviz/pull/270) over these .fa & .gff3 files.

FlicAnderson commented 3 years ago

Following on from @ewallace 's request to allow him to compare TPMs to read density information from Julie Aspden (email of 18th Jan 2021, also sent to @ewallace), these are the TPMs requested. Drosophila_tpms.zip

ewallace commented 3 years ago

This looks good. I compared tpms from riboviz analysis with rpkms (differently normalized but same idea) from GEO:GSE60384, because the version on GEO already had the same ORF/gene ids.

small_polysome_comparison_2021-04-02

There's a lot of agreement - most genes have highly correlated values across 4 orders of magnitude.

It would be nice to spot-check ORFs with big differences and see if there's any obvious systematic reason for differential counting (read length, mitochondrial genes, etc). @IsabelBirds could you have a look?

The code used to join and generate the plots, in R markdown format, is attached Drosophila_ORF_output_check_smallpolysomes.Rmd.txt. That should be useful for follow-up work (change file extension back to .txt to run it in Rstudio).

I think we can close this issue, it's successful.

FlicAnderson commented 3 years ago

As a final check, here's the output for the latest version of check_fasta_gff from develop branch:

[fanders6@login01(eddie) ~]$ cat gff_fa_A-Dm_2014-9920713-node3g10.ecdf.ed.ac.uk.o
Running check_fasta_gff on transcriptome files from dataset: A-Dm_2014
now in folder: /home/fanders6/riboviz/riboviz ready to run
Created by: RiboViz
Date: 2021-02-18 19:31:08.055514
Command-line tool: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
Version: commit 7e62c9f8da4f66f9c18aca30394f44fa4a7fa1ef date 2021-01-20 10:18:00-08:00

Configuration:
fasta_file  A-Dm_2014/Dmel_filtered.fasta
gff_file    A-Dm_2014/Dmel_filtered_UTRreplace.gff3
start_codons    ['ATG']

Issue summary:
Issue   Count
NoStopCodon 6953
NoStartCodon    6948
InternalStopCodon   6596
IncompleteFeature   33
SequenceNotInGFF    0
SequenceNotInFASTA  0
MultipleCDS 0
DuplicateFeatureId  0
NoIdName    0
DuplicateFeatureIds 0
check_fasta_gff for A-Dm_2014 complete

And the .tsv output file with further details (saved as .txt to allow upload here): features_issues.txt

I've not checked what the total number of genes is 'properly' by loading the .gff into anything, but if we say that each gene has ~ 1 x each of 5' UTR, CDS & 3' UTR, and we subtract the first 3 rows of comments in the gff file and remove this from total number of lines in the file (41,487), then this gives about ~41,484 features of any kind from ~ 13,828 genes.

Given that 41k estimate of the number of features, does having ~7k errors/issues for each of the codon-related issues seem reasonable or high? @IsabelBirds @ewallace

(I realise you might not have a chance to check this if you're busy @IsabelBirds, just tagging you in in case you have time for a quick opinion, and also for future info)

Knowing about this (check_fasta_gff.py) might also be useful for @3mma-mack since she's putting together a transcriptome for a different dataset.

ewallace commented 3 years ago

@FlicAnderson I'm so glad you checked that. Yes, we should run check_fasta_gff.py on all new transcriptome sets.

It looks like all the transcripts on the minus strand are wrong in that they've been included in reverse complement. I came to this conclusion because:

About half of the transcripts have a problem with "no start codon" or "no stop codon"
Spot-checking the top of the features_issues.txt file, Ir21a-RB and Cda5-RA are on the minus strand, have TTA as start codon which is reverse complement of TAA, and CAT as stop which is reverse complement of ATG

So we need to fix this by making a new transcriptome fasta / gff that appropriately takes the reverse complement of transcripts on the minus strand.

FlicAnderson commented 3 years ago

Arranged a meeting with Isabel (26th Mar) where we'll come up with a plan for next-steps on fixing the transcriptome files.

IsabelBirds commented 3 years ago

@FlicAnderson I think I've fixed this - the transcripts on the minus strand should all be reverse complemented now. https://github.com/IsabelBirds/Stuff_to_share/tree/main/Riboviz_work/2_Processed_data/Dmel_genomes - filtered versions for one CDS per gene.

I'll update the provenance file once we're happy it's sorted :)

FlicAnderson commented 3 years ago

Hi @IsabelBirds - sorry it's taken so long to check this, I've run these new files through check-fasta-gff again and here's the new output:

[fanders6@login02(eddie) ~]$ cat gff_fa_A-Dm_2014-11333517-node3g10.ecdf.ed.ac.uk.o
Running check_fasta_gff on transcriptome files from dataset: A-Dm_2014
now in folder: /home/fanders6/riboviz/riboviz ready to run
Created by: RiboViz
Date: 2021-04-19 14:07:55.957977
Command-line tool: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/riboviz/tools/check_fasta_gff.py
Version: commit 5f23a69d0dee1a2917440ce9eec86eadcd195eda date 2021-04-15 09:11:14-07:00

Configuration:
fasta_file  A-Dm_2014/Dmel_filtered.fasta
gff_file    A-Dm_2014/Dmel_filtered_UTRreplace.gff3
start_codons    ['ATG']

Issue summary:
Issue   Count
SequenceNotInFASTA  4086
InternalStopCodon   55
NoStopCodon 27
IncompleteFeature   24
NoStartCodon    11
NoIdName    0
SequenceNotInGFF    0
DuplicateFeatureId  0
DuplicateFeatureIds 0
MultipleCDS 0
check_fasta_gff for A-Dm_2014 complete

There's still quite a lot of sequences in the GFF which don't seem to be in the FASTA (if I'm correctly interpreting SequenceNotInFASTA), but in terms of the number of sequences with issues (e.g. no start/stop codon), it's far far lower and I think this is looking much better.

Do you have any ideas on why SequenceNotInFASTA might be quite high?

If you think this is an acceptable level of issues, I can proceed by re-running the full dataset with these files so we can double-check what the output plots look like now.

IsabelBirds commented 3 years ago

Thanks @FlicAnderson - I'll take a look this week and see why we're getting so many SequenceNotInFasta - not sure why that's coming up!

IsabelBirds commented 2 years ago

Coming back to this! I've updated my repo with the D.mel annotations and started testing locally.

Got errors at static html step, Flic and I are looking into this.

To Do:

[ ] Update provenance files
[ ] Test with Mics data set
[ ] Talk to Elton

riboviz / example-datasets

Add Aspden lab Drosophila example dataset #10

Created Downsampled Fastq File