Add new dataset to look at the effect of the RNA-binding protein Bfr1 on translation in S. cerevisiae. Note that the data were shared privately by Gunter Kramer, and are not on SRA as far as I know.

[x] Create a new branch of example-datasets, Castells-Bfr1-107
[x] Identify paper or data source - Translational Regulation of Pmt1 and Pmt2 by Bfr1 Affects Unfolded Protein O-Mannosylation, Castells-Ballester et al 2019, PMID: 31835530
[x] Identify the species and strain used - S. cerevisiae wild type (BY4741) and bfr1Δ. example-datasets has these.
[x] Identify the ribosome profiling samples from the dataset - yes, no matched RNA-seq I think.
[x] Identify adapter sequence - provide sequence.
[x] Confirm or deny presence of UMIs and barcodes if used - describe if present.
[x] If UMIs are present, create UMI regular expression.
[x] Using information gathered, create config file.
[x] Download sample data - shared by Gunter Kramer, they are on lab datastore at wallace_rna/bigdata/2021/Bfr1_ribosomeprofiling_Kramer. We will need to copy over to Eddie, collate the files and remove the spaces from filenames.
[x] (optional) Create downsampled data and fast test run on that.
[x] Test run of full sized dataset.
[x] Look at results - check for 3nt periodicity in coding regions, most common read lengths being 28-32 nt, and clear start and stop profiles.
[x] Troubleshoot as necessary and discuss on issue ticket.
[x] Update genus-level README.md and provenance section of config file.
[ ] Put in pull request to add to repository.

Protocol was referenced extensively in the paper, linkers from Selective Ribosome Profiling to study interactions of translating ribosomes in yeast:

Linker 3-L1 with 5′ adenylation and 3′ dideoxy-Cytidine, unique molecular identifiers (’NN…’) (IDT, RNase-free HPLC purification): 5′-/5rApp/NNNNNATCGTAGATCGGAAGAGCACACGTCTGAA/3ddC/-3′

Linker reverse transcription L(rt) with 5′ phosphorylated, unique molecular identifiers (’NN…’) (IDT, RNase-free HPLC purification): 5′-/5Phos/NNAGATCGGAAGAGCGTCGTGTAGGGAAAGAG/iSp18/GTGACTGGAGTTCAGACGTGTGCTC-3′

That means:

adapter is ATCGTAGATCGGAAGAGCACACGTCTGAA (Note: the excel file sent by the authors specifies a longer adaptor that is identical at the start: ATCGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC. Using the shorter one should be fine?)
UMI has 2 parts, 2nt at 5' end, then 5nt at 3' end: ^(?P<umi_1>.{2}).+(?P<umi_2>.{5})$

Download sample data

Data were shared by Gunter Kramer, they are on lab datastore at wallace_rna/bigdata/2021/Bfr1_ribosomeprofiling_Kramer. We will need to copy over to Eddie, collate the files and remove the spaces from filenames.

First we logged in to an Eddie staging node to download the data.

$ fastq_dir=/exports/csce/datastore/biology/groups/wallace_rna/bigdata/2021/Bfr1_ribosomeprofiling_Kramer

$ ls -l  ${fastq_dir}
total 4984193
-rw------- 1 ewallac2 Domain Users        165 May 11 11:05 ~$Bfr data_Bfr1_RP.xlsx
-rw------- 1 ewallac2 Domain Users       8648 Jun 24  2021 Bfr_data_Bfr1_RP.xlsx
-rw------- 1 ewallac2 Domain Users  761579861 Jun 24  2021 Bfr_data_DeltaBfrRep1.fastq.gz
-rw------- 1 ewallac2 Domain Users  388571886 Jun 24  2021 Bfr_data_DeltaBfrRep2_1.fastq.gz
-rw------- 1 ewallac2 Domain Users  461166456 Jun 24  2021 Bfr_data_DeltaBfrRep2_2fastq.gz
-rw------- 1 ewallac2 Domain Users  554229827 Jun 25  2021 Bfr_data_DeltaBfrRep2_3.fastq.gz
-rw------- 1 ewallac2 Domain Users  600514575 Jun 25  2021 Bfr_data_DeltaBfrRep2_4.fastq.gz
-rw------- 1 ewallac2 Domain Users  784814528 Jun 24  2021 Bfr_data_DeltaBfrRep2.fastq.gz
-rw------- 1 ewallac2 Domain Users  467822708 Jun 24  2021 Bfr_data_wtRep1.fastq.gz
-rw------- 1 ewallac2 Domain Users 1084257521 Jun 24  2021 Bfr_data_wtRep2.fastq.gz
-rw------- 1 ewallac2 Domain Users        449 Jun 25  2021 md5sums.txt

Note that, as Bfr_data_Bfr1_RP.xlsx says, sample DeltaBfrRep2 "This sample was sequenced multiple times to get more reads, so you have to merge these files." For now I will NOT merge these files becuase I don't think an initial analysis needs 2.5GB or reads rather than 0.7GB. We could revisit that if more depth is needed for an analysis later.

To copy these to the fastq datafiles in the group storage space on eddie, we adapted Flic's standard abbreviation CB-Sc-Bfr1_2019 (for Castells-Ballester, S. cerevisiae, Bfr1, 2019)

$ fastq_dir_eddie=/exports/csce/eddie/biology/groups/wallace_rna/fastq-datafiles/CB-Sc-Bfr1_2019
$ mkdir ${fastq_dir_eddie}
$ cp ${fastq_dir}/*Rep1.fastq.gz ${fastq_dir_eddie}
$ cp ${fastq_dir}/*Rep2.fastq.gz ${fastq_dir_eddie}
$ ls -l ${fastq_dir_eddie}

total 3026176
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  761579861 May 11 11:29 Bfr_data_DeltaBfrRep1.fastq.gz
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  784814528 May 11 11:30 Bfr_data_DeltaBfrRep2.fastq.gz
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  467822708 May 11 11:29 Bfr_data_wtRep1.fastq.gz
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna 1084257521 May 11 11:30 Bfr_data_wtRep2.fastq.gz

Next, create a downsampled dataset to use for troubleshooting. Let's take the initial 100,000 (10^6) reads from wtRep1. This means we decompress, take the initial 400,000 lines of the file (because each read takes 4 lines in a fastq file, Amy), then compress again. Before that we print the first 5 reads (20 lines) to terminal for a reality check, and confirm that it is really in fastq format and the adapter sequence is present.

$ gzip -dc ${fastq_dir_eddie}/Bfr_data_wtRep1.fastq.gz | head -n 20
@NB551333:149:H7LJGBGXB:1:11101:5518:1077 1:N:0:CGTACG
CGCCCCAAAATGGTTTTAGTTCNAGATTTATTGCTTTGGATCGTAGATCGG
+
6AAAAEEEEEEEEEEEEEEEAE#EEEEEEEE6EA//EEEAEEEEEE/EEEE
@NB551333:149:H7LJGBGXB:1:11101:22122:1080 1:N:0:CGTACG
CCGTATGGAATCTAAACCATAGTTATGACGATTGCTCTTGGTAATCGTAGA
+
AAAAAAEEEEAEAEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB551333:149:H7LJGBGXB:1:11101:14056:1080 1:N:0:CGTACG
CGTTTTCCACGTTCTAGCATTCAAGGTCCCTTAGCATCGTAGATCGGAAGA
+
AAAAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEAEEEEE
@NB551333:149:H7LJGBGXB:1:11101:21014:1083 1:N:0:CGTACG
AGTACGCGAAACTCAGGTGCTGCAATCTGTAGAATCGTAGATCGGAAGAGC
+
AAAAAEEEEEEE/EEE/EEEEEEEEEEAEEEEEEEE/AEEEE<EAEEEEE<
@NB551333:149:H7LJGBGXB:1:11101:11242:1083 1:N:0:CGTACG
CCATCGGGTTATGCGTGTGTTACATGAACTTAGTGGATAATCGTAGATCGG
+
AAAAAEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

$ gzip -dc ${fastq_dir_eddie}/Bfr_data_wtRep1.fastq.gz | head -n 400000 | gzip > ${fastq_dir_eddie}/Bfr_data_wtRep1_init100000.fastq.gz

ls -l ${fastq_dir_eddie}
total 3028864
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  761579861 May 11 11:29 Bfr_data_DeltaBfrRep1.fastq.gz
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  784814528 May 11 11:30 Bfr_data_DeltaBfrRep2.fastq.gz
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  467822708 May 11 11:29 Bfr_data_wtRep1.fastq.gz
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna    2670212 May 11 11:53 Bfr_data_wtRep1_init100000.fastq.gz
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna 1084257521 May 11 11:30 Bfr_data_wtRep2.fastq.gz

Running on test dataset with 100,000 reads

Logged on freshly to an Eddie interactive session with

$ qlogin -pe interactivemem 4 -l h_vmem=4G
...
(base) [ewallac2@node1h20(eddie) ~]$

This logged us in to node1h20, where @asnewell previously had problems? If anything weird happens, we can try to login to a different interactive node.

Next setup riboviz environment and check example-datasets repository is up to date and in the right branch

$ source set-riboviz-env.sh
$ cd riboviz/example-datasets/
$ git pull
$ git checkout Castells-Bfr1-107
Branch Castells-Bfr1-107 set up to track remote branch Castells-Bfr1-107 from origin.
Switched to a new branch 'Castells-Bfr1-107'

Next we try to construct a command to run on the subsampled data. (Note: we are running on a version of CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml at commit 9b2b997, that has full-size datasets commented out and only processes samples sub: Bfr_data_wtRep1_init100000.fastq.gz.

First we navigate to riboviz/riboviz/ and symlink the config file.

$ cd /home/$USER/riboviz/riboviz
$ ln -s /home/$USER/riboviz/example-datasets/fungi/saccharomyces/CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml 

$ nextflow run prep_riboviz.nf \
  -params-file CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml \
  -work-dir /exports/eddie/scratch/${USER}/work \
  -ansi-log false --validate_only

N E X T F L O W  ~  version 20.04.1

Launching `prep_riboviz.nf` [backstabbing_mahavira] - revision: acd7535f8d
Validating configuration only
samples_dir: .
organisms_dir: .
data_dir: .
No such directory (dir_in): ./input

That failed because we didn't specify the input directory. Next step is to specify the correct input directories. We will first try environment variables as described in environment variables and configuration tokens

We create a RIBOVIZ_SAMPLES directory, and then symlink the fastq file directory to ${RIBOVIZ_SAMPLES}/input

$ RIBOVIZ_SAMPLES=/exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
$ mkdir ${RIBOVIZ_SAMPLES}
$ cd ${RIBOVIZ_SAMPLES}
$ ln -s /exports/csce/eddie/biology/groups/wallace_rna/fastq-datafiles/CB-Sc-Bfr1_2019 input

$ cd /home/$USER/riboviz/riboviz
$ ls ${RIBOVIZ_SAMPLES}/input
Bfr_data_DeltaBfrRep1.fastq.gz  Bfr_data_wtRep1.fastq.gz             Bfr_data_wtRep2.fastq.gz
Bfr_data_DeltaBfrRep2.fastq.gz  Bfr_data_wtRep1_init100000.fastq.gz

$ export RIBOVIZ_SAMPLES=/exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
$ export RIBOVIZ_ORGANISMS=/home/${USER}/riboviz/example-datasets/fungi/saccharomyces
$ export RIBOVIZ_DATA=/home/${USER}/riboviz/riboviz/data

$ nextflow run prep_riboviz.nf \
  -params-file CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml \
  -work-dir /exports/eddie/scratch/${USER}/work \
  -ansi-log false --validate_only

N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [exotic_colden] - revision: acd7535f8d
Validating configuration only
samples_dir: /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
organisms_dir: /home/ewallac2/riboviz/example-datasets/fungi/saccharomyces
data_dir: /home/ewallac2/riboviz/riboviz/data
Validated configuration

Victory! We can find all the files. What will happen when we try to run the actual dataset? Same command as above but without --validate-only, same export variables.

$ nextflow run prep_riboviz.nf \
  -params-file CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml \
  -work-dir /exports/eddie/scratch/${USER}/work \
  -ansi-log false 

N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [adoring_stone] - revision: acd7535f8d
samples_dir: /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
organisms_dir: /home/ewallac2/riboviz/example-datasets/fungi/saccharomyces
data_dir: /home/ewallac2/riboviz/riboviz/data
[4e/99de94] Submitted process > buildIndicesrRNA (yeast_rRNA)
[ba/add629] Submitted process > cutAdapters (sub)
[9b/5cfd9c] Submitted process > buildIndicesORF (yeast_CDS_w_250)
[e2/efbf15] Submitted process > createVizParamsConfigFile
[26/276bc0] Submitted process > createInteractiveVizParamsConfigFile
[f6/0d2405] Submitted process > extractUmis (sub)
[87/294e1e] Submitted process > hisat2rRNA (sub)
[ed/69c0b5] Submitted process > hisat2ORF (sub)
[cf/95701b] Submitted process > trim5pMismatches (sub)
[a6/9f4835] Submitted process > samViewSort (sub)
[45/36b2f8] Submitted process > outputBams (sub)
[f2/2912c1] Submitted process > makeBedgraphs (sub)
[23/6174c2] Submitted process > bamToH5 (sub)
[56/991285] Submitted process > generateStatsFigs (sub)
...

This ran and looked like it was working, but then bamToH5 (sub) was very slow (30mins?) and generateStatsFigs (sub) was even slower (45mins+). We suspect it got stuck. Abandoning for now. Will try again on a different node, probably. I checked that the output files from earlier parts looked about right with:

$ ls -l ${RIBOVIZ_SAMPLES}/output/sub
total 53888
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna   135951 May 11 16:11 minus.bedgraph
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  1212439 May 11 16:11 plus.bedgraph
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  1660424 May 11 16:11 sub.bam
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna   320040 May 11 16:11 sub.bam.bai
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna   278439 May 11 16:24 sub.h5
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna 51008296 May 11 16:24 sub.h5.1

Tried again from screen (remote session). Again logged in to node1h20

$ source set-riboviz-env.sh
$ cd /home/$USER/riboviz/riboviz
$ export RIBOVIZ_SAMPLES=/exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
$ export RIBOVIZ_ORGANISMS=/home/${USER}/riboviz/example-datasets/fungi/saccharomyces
$ export RIBOVIZ_DATA=/home/${USER}/riboviz/riboviz/data
$ nextflow run prep_riboviz.nf \
  -params-file CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml \
  -work-dir /exports/eddie/scratch/${USER}/work \
  -ansi-log false
N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [happy_albattani] - revision: acd7535f8d
samples_dir: /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
organisms_dir: /home/ewallac2/riboviz/example-datasets/fungi/saccharomyces
data_dir: /home/ewallac2/riboviz/riboviz/data
[7d/aa260c] Submitted process > buildIndicesORF (yeast_CDS_w_250)
[fb/b90f10] Submitted process > cutAdapters (sub)
[07/3d600c] Submitted process > buildIndicesrRNA (yeast_rRNA)
[2e/3646ae] Submitted process > createVizParamsConfigFile
[df/96b320] Submitted process > createInteractiveVizParamsConfigFile
[8b/2a3ec3] Submitted process > extractUmis (sub)
[89/562dd9] Submitted process > hisat2rRNA (sub)
[cf/8ef719] Submitted process > hisat2ORF (sub)
[62/ac3981] Submitted process > trim5pMismatches (sub)
[04/8376f6] Submitted process > samViewSort (sub)
[eb/d73e08] Submitted process > outputBams (sub)
[ed/6c635e] Submitted process > makeBedgraphs (sub)
[98/15bc0c] Submitted process > bamToH5 (sub)
... # did not copy generateStatsFigs line properly here.
Finished processing sample: sub
[bc/a5f6ed] Submitted process > staticHTML (sub)
[cd/7fd80e] Submitted process > renameTpms (sub)
[e1/dcf2da] Submitted process > collateTpms (sub)
Finished visualising sample: sub
[c0/50af69] Submitted process > countReads
Workflow finished! (OK)

Again, sprinted through the first steps and slowed down on bamToH5. Left running ... ... about 4h later, checked in, finished, looks good.

$ ls -l ${RIBOVIZ_SAMPLES}/output/
total 260
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna   342 May 11 17:38 interactive_viz_config.yaml
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna  1601 May 11 18:58 read_counts_per_file.tsv
drwx--S--- 2 ewallac2 datastore_biology_groups_wallace_rna  2048 May 11 18:58 sub
-rw------- 1 ewallac2 datastore_biology_groups_wallace_rna 70225 May 11 18:57 TPMs_all_CDS_all_samples.tsv

$ more ${RIBOVIZ_SAMPLES}/output/read_counts_per_file.tsv 
# Created by: riboviz
# Date: 2022-05-11 18:58:12.267001
# Command-line tool: /exports/eddie3_homes_local/ewallac2/riboviz/riboviz/riboviz/tools/count_reads
.py
# File: /exports/eddie3_homes_local/ewallac2/riboviz/riboviz/riboviz/count_reads.py
# Version: commit cc97e742686617dea1d34d2387fa0e4d63a5f9d5 date 2022-05-09 23:38:36+02:00
SampleName  Program File    NumReads    Description
sub input   /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019/input/Bfr_d
ata_wtRep1_init100000.fastq.gz  100000  input
sub cutadapt    /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019/tmp
/sub/trim.fq    99993   Reads after removal of sequencing library adapters
sub hisat2  /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019/tmp/sub/non
rRNA.fq 56844   Reads that did not align to rRNA or other contaminating reads in rRNA index files
sub hisat2  /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019/tmp/sub/rRN
A_map.sam   43126   Reads aligned to rRNA and other contaminating reads in rRNA index files
sub hisat2  /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019/tmp/sub/una
ligned.fq   21205   Unaligned reads removed by alignment of remaining reads to ORFs index files
sub hisat2  /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019/tmp/sub/orf
_map.sam    41930   Reads aligned to ORFs index files
sub riboviz.tools.trim_5p_mismatch  /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-
Sc-Bfr1_2019/tmp/sub/orf_map_clean.sam  41930   Reads after trimming of 5' mismatches and removal of those with more than 2 mismatches

I moved the subsampled output to a new directory, output_sub_init100000 - and deleted tmp

mv ${RIBOVIZ_SAMPLES}/output ${RIBOVIZ_SAMPLES}/output_sub_init100000
rm -rf ${RIBOVIZ_SAMPLES}/tmp

Run on dataset with all 4 samples

Updated config file in commit 8998b67, to process all samples with num_processes = 4.

Trying again on Eddie interactive node.

$ screen
$ qlogin -pe interactivemem 4 -l h_vmem=6G
...
Your interactive job 19545239 has been successfully scheduled.
(base) [ewallac2@node1h21(eddie) ~]$ source set-riboviz-env.sh
(riboviz) (base) [ewallac2@node1h21(eddie) ~]$ export PS1="$ " # just shorten command prompt
$ cd /home/$USER/riboviz/riboviz
$ export RIBOVIZ_SAMPLES=/exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
$ export RIBOVIZ_ORGANISMS=/home/${USER}/riboviz/example-datasets/fungi/saccharomyces
$ export RIBOVIZ_DATA=/home/${USER}/riboviz/riboviz/data
$ nextflow run prep_riboviz.nf \
  -params-file CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml \
  -work-dir /exports/eddie/scratch/${USER}/work \
  -ansi-log false --validate_only
N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [golden_meucci] - revision: acd7535f8d
Validating configuration only
samples_dir: /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
organisms_dir: /home/ewallac2/riboviz/example-datasets/fungi/saccharomyces
data_dir: /home/ewallac2/riboviz/riboviz/data
Validated configuration
$ nextflow run prep_riboviz.nf \
  -params-file CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml \
  -work-dir /exports/eddie/scratch/${USER}/work \
  -ansi-log false
N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [big_sax] - revision: acd7535f8d
samples_dir: /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
organisms_dir: /home/ewallac2/riboviz/example-datasets/fungi/saccharomyces
data_dir: /home/ewallac2/riboviz/riboviz/data
[8b/f05bb4] Submitted process > cutAdapters (DeltaBfr_Rep1)
[0d/e6fe6b] Submitted process > cutAdapters (WT_Rep2)
[95/e9202a] Submitted process > createVizParamsConfigFile
[1a/a3ddee] Submitted process > cutAdapters (DeltaBfr_Rep2)
[0c/375d74] Submitted process > buildIndicesrRNA (yeast_rRNA)
[a9/650ef4] Submitted process > buildIndicesORF (yeast_CDS_w_250)
[18/b318e8] Submitted process > cutAdapters (WT_Rep1)
[4d/bfcd68] Submitted process > createInteractiveVizParamsConfigFile
[a0/2c58b8] Submitted process > extractUmis (WT_Rep1)
[ef/cec40a] Submitted process > extractUmis (DeltaBfr_Rep1)
[a5/59383e] Submitted process > extractUmis (DeltaBfr_Rep2)
[7f/9f74ed] Submitted process > hisat2rRNA (WT_Rep1)
[29/b7bb27] Submitted process > extractUmis (WT_Rep2)
[2f/d70409] Submitted process > hisat2ORF (WT_Rep1)
[d9/bfc870] Submitted process > trim5pMismatches (WT_Rep1)
[dc/f06db8] Submitted process > samViewSort (WT_Rep1)
[e7/81a626] Submitted process > outputBams (WT_Rep1)
[8e/75ce5a] Submitted process > makeBedgraphs (WT_Rep1)
[1f/934ca5] Submitted process > bamToH5 (WT_Rep1)
[aa/9e6477] Submitted process > hisat2rRNA (DeltaBfr_Rep1)
[51/92d0ed] Submitted process > hisat2rRNA (DeltaBfr_Rep2)
[a8/f3aa7a] Submitted process > hisat2ORF (DeltaBfr_Rep1)
[76/369ef4] Submitted process > hisat2ORF (DeltaBfr_Rep2)
[98/5fac9d] Submitted process > trim5pMismatches (DeltaBfr_Rep1)
[61/4c81df] Submitted process > trim5pMismatches (DeltaBfr_Rep2)
[7b/ab79f7] Submitted process > samViewSort (DeltaBfr_Rep2)
[97/ba7c78] Submitted process > samViewSort (DeltaBfr_Rep1)
[0d/268c39] Submitted process > outputBams (DeltaBfr_Rep2)
[99/7b5df7] Submitted process > bamToH5 (DeltaBfr_Rep2)
[aa/87ad5e] Submitted process > makeBedgraphs (DeltaBfr_Rep2)
[1d/e17d58] Submitted process > generateStatsFigs (WT_Rep1)
[c3/701527] Submitted process > outputBams (DeltaBfr_Rep1)
[eb/daf151] Submitted process > makeBedgraphs (DeltaBfr_Rep1)
[8c/7c7cbe] Submitted process > bamToH5 (DeltaBfr_Rep1)
[4f/d2692a] Submitted process > generateStatsFigs (DeltaBfr_Rep2)
[6a/7cde88] Submitted process > hisat2rRNA (WT_Rep2)
[ff/999eeb] Submitted process > generateStatsFigs (DeltaBfr_Rep1)
[55/7847ad] Submitted process > hisat2ORF (WT_Rep2)
[4e/0526f3] Submitted process > trim5pMismatches (WT_Rep2)
[dd/f6946d] Submitted process > samViewSort (WT_Rep2)
[d5/a79501] Submitted process > outputBams (WT_Rep2)
[1f/a0d76d] Submitted process > bamToH5 (WT_Rep2)
[c1/d2f700] Submitted process > makeBedgraphs (WT_Rep2)
[40/fbf219] Submitted process > generateStatsFigs (WT_Rep2)
Finished processing sample: DeltaBfr_Rep2
[3a/b10216] Submitted process > renameTpms (DeltaBfr_Rep2)
[3f/56bbf1] Submitted process > staticHTML (DeltaBfr_Rep2)
Finished visualising sample: DeltaBfr_Rep2
Finished processing sample: DeltaBfr_Rep1
[a1/b5cbf5] Submitted process > renameTpms (DeltaBfr_Rep1)
[4a/6a697c] Submitted process > staticHTML (DeltaBfr_Rep1)
Finished visualising sample: DeltaBfr_Rep1
Finished processing sample: WT_Rep1
[97/3ede1e] Submitted process > renameTpms (WT_Rep1)
[d7/2b3a6f] Submitted process > staticHTML (WT_Rep1)
Finished visualising sample: WT_Rep1
Finished processing sample: WT_Rep2
[19/d0c3ea] Submitted process > renameTpms (WT_Rep2)
[77/d71ba3] Submitted process > staticHTML (WT_Rep2)
[97/3ab52b] Submitted process > collateTpms (DeltaBfr_Rep2, DeltaBfr_Rep1, WT_Rep1, WT_Rep2)
Finished visualising sample: WT_Rep2
[42/e8957b] Submitted process > countReads
Workflow finished! (OK)
$ nextflow log
...
2022-05-11 22:40:13     7h 45s          big_sax                 OK      acd7535f8d      357d19ec-f24e-483b-97f2-367ad679fb78    nextflow run prep_riboviz.nf -params-file CastellsBallester_2019_Bfr1_4samples_Scerevisiae.yaml -work-dir /exports/eddie/scratch/ewallac2/work -ansi-log false

Ran overnight in the end, 7h45min. Quite long!

Next: check the output.

Post-mortem of reads

We discussed how to view the output of a nextflow run. Unfortunately realised that we forgot to set options for html report -with-report. Still we checked the logs

$ nextflow log big_sax -f 'process,exit,hash,duration'
cutAdapters 0   8b/f05bb4   6m 51s
cutAdapters 0   0d/e6fe6b   11m 8s
createVizParamsConfigFile   0   95/e9202a   140ms
cutAdapters 0   1a/a3ddee   7m 15s
buildIndicesrRNA    0   0c/375d74   6.9s
buildIndicesORF 0   a9/650ef4   6.6s
cutAdapters 0   18/b318e8   3m 45s
createInteractiveVizParamsConfigFile    0   4d/bfcd68   216ms
extractUmis 0   a0/2c58b8   7m 7s
extractUmis 0   ef/cec40a   13m 9s
extractUmis 0   a5/59383e   14m 20s
hisat2rRNA  0   7f/9f74ed   3m 52s
extractUmis 0   29/b7bb27   36m 6s
hisat2ORF   0   2f/d70409   2m 16s
trim5pMismatches    0   d9/bfc870   1m 15s
samViewSort 0   dc/f06db8   1m 5s
outputBams  0   e7/81a626   2.2s
makeBedgraphs   0   8e/75ce5a   28.5s
bamToH5 0   1f/934ca5   17m 8s
hisat2rRNA  0   aa/9e6477   10m 21s
hisat2rRNA  0   51/92d0ed   13m 2s
hisat2ORF   0   a8/f3aa7a   4m 43s
hisat2ORF   0   76/369ef4   1m 26s
trim5pMismatches    0   98/5fac9d   1m 26s
trim5pMismatches    0   61/4c81df   32.2s
samViewSort 0   7b/ab79f7   18s
samViewSort 0   97/ba7c78   1m 18s
outputBams  0   0d/268c39   385ms
bamToH5 0   99/7b5df7   9m 36s
makeBedgraphs   0   aa/87ad5e   8.8s
generateStatsFigs   0   1d/e17d58   5h 38m 25s
outputBams  0   c3/701527   1.8s
makeBedgraphs   0   eb/daf151   54.2s
bamToH5 0   8c/7c7cbe   10m 59s
generateStatsFigs   0   4f/d2692a   4h 30m 8s
hisat2rRNA  0   6a/7cde88   8m 26s
generateStatsFigs   0   ff/999eeb   5h 21m 56s
hisat2ORF   0   55/7847ad   4m 28s
trim5pMismatches    0   4e/0526f3   1m 57s
samViewSort 0   dd/f6946d   1m 37s
outputBams  0   d5/a79501   1.9s
bamToH5 0   1f/a0d76d   14m 22s
makeBedgraphs   0   c1/d2f700   1m 6s
generateStatsFigs   0   40/fbf219   5h 28m 15s
renameTpms  0   3a/b10216   161ms
staticHTML  0   3f/56bbf1   42.9s
renameTpms  0   a1/b5cbf5   161ms
staticHTML  0   4a/6a697c   23.4s
renameTpms  0   97/3ede1e   50ms
staticHTML  0   d7/2b3a6f   17.1s
renameTpms  0   19/d0c3ea   57ms
staticHTML  0   77/d71ba3   17.5s
collateTpms 0   97/3ab52b   1.3s
countReads  0   42/e8957b   12m 58s
Exception in thread "main" java.lang.OutOfMemoryError: Metaspace

Download results and check the output.

$ export RIBOVIZ_SAMPLES=/exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019
$ OUTPUT_DATASTORE=/exports/csce/datastore/biology/groups/wallace_rna/data/2022/05-May/Edward/Bfr1_riboviz_2022-05-12
$ ls -l ${RIBOVIZ_SAMPLES}
total 4
drwx--S--- 2 ewallac2 datastore_biology_groups_wallace_rna 2048 May 11 22:40 index
lrwxrwxrwx 1 ewallac2 datastore_biology_groups_wallace_rna   78 May 11 15:52 input -> /exports/csce/eddie/biology/groups/wallace_rna/fastq-datafiles/CB-Sc-Bfr1_2019
drwx--S--- 6 ewallac2 datastore_biology_groups_wallace_rna  512 May 12 05:40 output
drwx--S--- 3 ewallac2 datastore_biology_groups_wallace_rna  512 May 11 18:58 output_sub_init100000
drwx--S--- 6 ewallac2 datastore_biology_groups_wallace_rna  512 May 11 22:51 tmp
$ du -sh ${RIBOVIZ_SAMPLES}/
1.3G    /exports/csce/eddie/biology/groups/wallace_rna/20221105_CB-Sc-Bfr1_2019/ls -l ${
$ cp -r ${RIBOVIZ_SAMPLES}/output_sub_init100000 ${OUTPUT_DATASTORE}
$ cp -r ${RIBOVIZ_SAMPLES}/output ${OUTPUT_DATASTORE}

This means we are done with Eddie for a bit and will inspect the output files locally.

We checked the outputs.

Need to push branch to example-datasets.

riboviz / example-datasets

Add new dataset Castells-Ballester et al 2019 Bfr1 S. cerevisiae #107

Download sample data

Running on test dataset with 100,000 reads

Run on dataset with all 4 samples

Post-mortem of reads

Download results and check the output.