Add Cryptococcus neoformans H99 dataflow from Wallace 2020

ewallace commented 4 years ago

I will add the riboviz v1 config.yaml file to this in a branch CryptococcusH99-3, as well as uploading to the organism-specific annotation files to use.

@XUEXUEXUE0 then needs to:

run the config.yaml through the updating script, see documentation
download the H99 ribosome profiling files to Eddie using fastq-dump or something else
test running riboviz on this data on Eddie
troubleshoot as necessary
review the riboviz outputs with me (@ewallace) and @FlicAnderson

Then we can try similar approach to other datasets.

ewallace commented 4 years ago

Commit ed33cbe adds the necessary files in fungi/cryptococcus.

The Riboviz 1.1 config.yaml file is Wallace_2020_H99_RIBOVIZ_1p0_NEEDSUPDATING_config.yaml, that will need to be put through the updating script and provenance information added. I tried to edit this to be minimally helpful:

removing absolute filepaths from the bifx cluster upon which I originally ran riboviz.
supplying the relevant SRA

Known problems:

unhelpful or inconsistent filenames
possibly unhelpful directory names
no provenance information
this has not been tested since October 2018, apparently

It will take some work to fix these. I suggest consulting the updates to README.md and the fungi-brar-2012-5 branch / #11 discussion for more clarification.

This commit also includes the JEC21 data for #4, including completely unedited (except for filename) Riboviz 1.1 config file Wallace_2020_JEC21_NEEDSCOMPLETEOVERHAUL_config.yaml. I suggest first getting the H99 data to work and nicely curated. Then fixing the JEC21 data.

XUEXUEXUE0 commented 4 years ago

Commit 88b26a2 and bc96424 add the provenance information, supply the relevant SRA and update the filename and filepath.

Still need to add provenace informarion for the annotation files and comtaminants file

ewallace commented 4 years ago

I tried to review this and test running the config.yaml, but failed at the point of downloading SRA files. @XUEXUEXUE0 how did you get the files from SRA to Eddie? I put up a new issue #13 to document the fastq-dump in general. The question of downloading with fastq-dump (or aspera connect, or whatever) on Eddie is specific to Edinburgh users on that cluster, so a separate issue.

If the config.yamls are updated, then we need to change the names to reflect this, i.e. remove NEEDSCOMPLETEOVERHAUL and so on from filenames.

(Yes, I still need to add provenance information for the annotation & contaminants.)

FlicAnderson commented 3 years ago

I've attempted to run the riboviz upgrade_config_file.py config yaml upgrade script on the Cryptococcus yamls (H99 AND JEC21 #4 ) to make sure they contain all the right parameters and are in the right order, since @XUEXUEXUE0 mentioned she'd done this by hand, and I noticed a few parameters were missing and making it difficult to troubleshoot the Nextflow error I've been having while running these (riboviz/#202). I ran into an error when trying to use the upgrade_config_file tool, and have created issue riboviz/#203 to look into this.

I did update the yaml file names and committed the change (d9648061), but this branch is still missing *_annotation_provenance.txt and *_contaminant_provenance.txt files to add details for the files (for example, see this example from Saccharomyces).

I think once we've got the .yaml configs updated, can run the dataset via nextflow, and have those provenance.txt files in place, it's ready to review (ideally running via Nextflow method on Eddie) & pull requesting.

ewallace commented 3 years ago

I haven't got back to this yet. @FlicAnderson have you run the dataset through yourself, with all the data files in place? Could you try it this week, or should I?

FlicAnderson commented 3 years ago

created the _annotation_provenance.txt and *_contaminant_provenance.txt files, although haven't populated them with complete info, as I need to find out more from @ewallace and/or the publication about how these files were obtained, how the transcriptome file was created etc. But I've added in some placeholder information with gaps to fill the information into, so this should be easier and follow the info we have for Brar 2012 Saccharomyces files too.
updated the .yaml paths to follow example-datasets/fungi/cryptococcus/annotations or /contaminants for the relevant .fa / .gff files and added a note about creating an /input folder within one with the dataset name (e.g. Wallace_2020_JEC21) for the input .fastq files as this is currently what's expected by the yaml and riboviz. This can be edited, and I'm not sure it's the easiest thing for users? Perhaps better to create the Wallace_2020_JEC21/input folder and leave it (empty), and direct users to download files to there, to make sure it'll match the yaml nicely? Thoughts welcome.

Next Steps: Tomorrow I'll try this on Eddie, and read through your updates re: comments on issue #59 again and check out your info for issue riboviz/#207 on how Nextflow might be able to do some of the SRA file downloading stuff. I'll also consider file locations best practices on Eddie and make notes for documentation updating.

FlicAnderson commented 3 years ago

Just a note to clarify that this work on issue #3 includes #4 as the files are in the same cryptococcus folder, with separate .yaml files pointing to their respective annotation/contaminant files, so progress on #3 also progresses #4.

kavousan commented 3 years ago

Thanks for your help with this, @XUEXUEXUE0 and good luck in the future!

FlicAnderson commented 3 years ago

H99 job completed successfully on Eddie!

Job start info:

Job 7139269 (W-Cn-H99_2020) Started
 User       = fanders6
 Queue      = eddie
 Host       = node3g12.ecdf.ed.ac.uk
 Start Time = 09/26/2020 02:39:07.376

Job finish info:

Job 7139269 (W-Cn-H99_2020) Complete
 User             = fanders6
 Queue            = eddie@node3g12.ecdf.ed.ac.uk
 Host             = node3g12.ecdf.ed.ac.uk
 Start Time       = 09/26/2020 02:39:07.838
 End Time         = 09/26/2020 11:20:53.966
 User Time        = 1:07:00:41
 System Time      = 1:13:37:11
 Wallclock Time   = 08:41:46
 CPU              = 2:20:37:53
 Max vmem         = 93.495G
 Max rss          = NA
 Exit Status      = 0
# YASSSSS

Output of my submission script.output:

[fanders6@login02(eddie) ~]$ cat W-Cn-H99_2020-7139269-node3g12.ecdf.ed.ac.uk.o
Running riboviz on dataset: Wallace_2020_H99
/exports/eddie/scratch/fanders6/Wallace_2020_H99/input

2020-09-26T01:39:39 prefetch.2.10.8: 2) 'SRR9336391' is found locally

2020-09-26T01:39:54 prefetch.2.10.8: 3) 'SRR9336393' is found locally

2020-09-26T01:39:55 prefetch.2.10.8: 4) 'SRR9336395' is found locally
hopefully downloaded and pigz'd the files into /exports/eddie/scratch/fanders6/Wallace_2020_H99/input
moved to /home/fanders6/riboviz/riboviz
now in folder: /home/fanders6/riboviz/riboviz ready to run
N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [exotic_shockley] - revision: e273db8045
Validating configuration only
Validated configuration
N E X T F L O W  ~  version 20.04.1
Launching `prep_riboviz.nf` [clever_celsius] - revision: e273db8045
[21/24983d] Submitted process > buildIndicesORF (H99_CDS_with_120bputrs)
[9a/384065] Submitted process > cutAdapters (HdAGO1)
[02/1deaed] Submitted process > cutAdapters (H99r2)
[1c/0ad4fa] Submitted process > cutAdapters (H99r1)
[a6/383b57] Submitted process > cutAdapters (HdGWO1)
[86/b1aeb2] Submitted process > buildIndicesrRNA (H99_rRNA)
[be/542177] Submitted process > hisat2rRNA (H99r1)
[aa/33543f] Submitted process > hisat2rRNA (H99r2)
[f4/716d63] Submitted process > hisat2rRNA (HdAGO1)
[f0/951c25] Submitted process > hisat2rRNA (HdGWO1)
[e9/3624b7] Submitted process > hisat2ORF (H99r2)
[41/00fa37] Submitted process > hisat2ORF (H99r1)
[08/f1cc24] Submitted process > hisat2ORF (HdAGO1)
[31/6723f7] Submitted process > hisat2ORF (HdGWO1)
[e4/5ec5bc] Submitted process > trim5pMismatches (H99r2)
[f7/5f627c] Submitted process > trim5pMismatches (H99r1)
[b6/176e36] Submitted process > samViewSort (H99r1)
[8b/f7f0ce] Submitted process > samViewSort (H99r2)
[53/441985] Submitted process > outputBams (H99r1)
[71/1dfa27] Submitted process > makeBedgraphs (H99r1)
[e0/d155be] Submitted process > bamToH5 (H99r1)
[d5/7c904e] Submitted process > trim5pMismatches (HdAGO1)
[f7/789ac3] Submitted process > trim5pMismatches (HdGWO1)
[05/0db017] Submitted process > outputBams (H99r2)
[8a/a7a281] Submitted process > bamToH5 (H99r2)
[c7/26e9a0] Submitted process > makeBedgraphs (H99r2)
[14/0859e0] Submitted process > generateStatsFigs (H99r1)
[5a/734d6f] Submitted process > generateStatsFigs (H99r2)
[74/ea199a] Submitted process > samViewSort (HdAGO1)
[35/772599] Submitted process > samViewSort (HdGWO1)
Finished processing sample: H99r1
[93/8ca92f] Submitted process > renameTpms (H99r1)
Finished processing sample: H99r2
[79/011487] Submitted process > renameTpms (H99r2)
[16/b847d6] Submitted process > outputBams (HdAGO1)
[a6/6b7e91] Submitted process > makeBedgraphs (HdAGO1)
[b9/e118a1] Submitted process > bamToH5 (HdAGO1)
[1b/ccc023] Submitted process > outputBams (HdGWO1)
[71/ca53b8] Submitted process > makeBedgraphs (HdGWO1)
[8d/0a05cd] Submitted process > bamToH5 (HdGWO1)
[db/fa1a34] Submitted process > generateStatsFigs (HdAGO1)
[b0/ca2e4f] Submitted process > generateStatsFigs (HdGWO1)
Finished processing sample: HdAGO1
[9d/f20658] Submitted process > renameTpms (HdAGO1)
Finished processing sample: HdGWO1
[e8/29be67] Submitted process > renameTpms (HdGWO1)
[a0/e320a7] Submitted process > collateTpms (H99r1, H99r2, HdAGO1, HdGWO1)
[76/65967b] Submitted process > countReads
Workflow finished! (OK)
nextflow riboviz Wallace_2020_H99 data run complete

Output of submission script .error

[fanders6@login02(eddie) ~]$ cat W-Cn-H99_2020-7139269-node3g12.ecdf.ed.ac.uk.e
WARNING: If you use conda to create environments, your home directory may fill up. Please see our documentation at 
 https://www.wiki.ed.ac.uk/display/ResearchServices/Anaconda for advice.
2020-09-26T01:39:36 prefetch.2.10.8 int: connection not found while validating within network system module - cannot open remote file: https://sra-downloadb.st-va.ncbi.nlm.nih.gov/sos1/sra-pub-run-16/SRR9336391/SRR9336391.1
2020-09-26T01:39:50 prefetch.2.10.8 int: connection not found while validating within network system module - cannot open remote file: https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009117/SRR9336393
spots read      : 138,051,092
reads read      : 138,051,092
reads written   : 138,051,092
spots read      : 172,330,085
reads read      : 172,330,085
reads written   : 172,330,085
spots read      : 160,189,934
reads read      : 160,189,934
reads written   : 160,189,934
spots read      : 169,471,324
reads read      : 169,471,324
reads written   : 169,471,324

quota results AFTER run:

[fanders6@login02(eddie) ~]$ quota
----------------------------------------------------------------
                 DISK QUOTA AND USAGE                              
/home/fanders6:
     1251.62 MB of 10240.00 MB (12.22%) used 

/exports/eddie/scratch/fanders6:
     813.28 GB of 2048.00 GB (39.71%) used 

/exports/csce/eddie/biology/groups/wallace_rna:
     48.97 GB of 200.00 GB (24.48%) used 
----------------------------------------------------------------

Disk usage outputs for the run folder (on my scratch space):

[fanders6@login02(eddie) ~]$ du -ch /exports/eddie/scratch/fanders6/Wallace_2020_H99/
1.9G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/H99r1
2.9G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/HdGWO1
2.3G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/H99r2
2.5G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/HdAGO1
9.4G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output
77G /exports/eddie/scratch/fanders6/Wallace_2020_H99/tmp/H99r1
104G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/tmp/HdGWO1
94G /exports/eddie/scratch/fanders6/Wallace_2020_H99/tmp/H99r2
100G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/tmp/HdAGO1
373G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/tmp
80M /exports/eddie/scratch/fanders6/Wallace_2020_H99/index
27G /exports/eddie/scratch/fanders6/Wallace_2020_H99/input
134M    /exports/eddie/scratch/fanders6/Wallace_2020_H99/work/c7/26e9a0237235208a4a9d9a0f1fae98
# ... lots more nextflow work folders...
2.1M    /exports/eddie/scratch/fanders6/Wallace_2020_H99/work/db
382G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/work
791G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/
791G    total

:exclamation: 791GB?!?!? :interrobang: This has escalated A LOT from 27GB input files...

Checking input size vs just output folder:

[fanders6@login02(eddie) ~]$ du -h /exports/eddie/scratch/fanders6/Wallace_2020_H99/input/
27G /exports/eddie/scratch/fanders6/Wallace_2020_H99/input/

[fanders6@login02(eddie) ~]$ du -ch /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/
1.9G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/H99r1
2.9G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/HdGWO1
2.3G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/H99r2
2.5G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/HdAGO1
9.4G    /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/
9.4G    total

27GB input + 9.4GB output isn't too bad, as long as it's safe to just ignore everything else... :see_no_evil:

I compared the outputs to some from Siyin which had been created previously, and they seem to match (just different order).

# checked the data already in the folder from Siyin's run: 
[fanders6@login02(eddie) ~]$ ls /exports/csce/eddie/biology/groups/wallace_rna/Wallace_2020_H99/
H99r1  H99r2  HdAGO1  HdGWO1  TPMs_collated.tsv
[fanders6@login02(eddie) ~]$ head /exports/csce/eddie/biology/groups/wallace_rna/Wallace_2020_H99/TPMs_collated.tsv 
# Created by: RiboViz
# Date: 2020-06-21 11:41:34
# File: /exports/eddie/scratch/s1919303/riboviz/rscripts/collate_tpms.R
# Version: commit 931e95449d3017992d6c07f5a0c156605c9c6ece date 2020-06-04 14:57:22 GMT
ORF     H99r1   HdGWO1  H99r2   HdAGO1
CNAG_00002  67.3    62.2    53  59.9
CNAG_00003  0.2 0.1 0.1 0.1
CNAG_00004  6.9 6.1 6.4 7
CNAG_00005  8.6 7.5 8.6 9.7
CNAG_00006  166.6   172.3   184.8   187.5

# checked against my run. Same numbers, different sample order. Not sure exactly why, but the same numbers dor the same samples are reassuring. 
[fanders6@login02(eddie) ~]$ head /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/TPMs_collated.tsv 
# Created by: RiboViz
# Date: 2020-09-26 07:06:50
# File: /exports/eddie3_homes_local/fanders6/riboviz/riboviz/rscripts/collate_tpms.R
# Version: commit 0f4e932d8c7de032d66f68048989b182730e7d49 date 2020-09-22 15:50:23 GMT
ORF H   99r1    H99r2   HdAGO1  HdGWO1
CNAG_00002  67.3    53  59.9    62.2
CNAG_00003  0.2 0.1 0.1 0.1
CNAG_00004  6.9 6.4 7   6.1
CNAG_00005  8.6 8.6 9.7 7.5
CNAG_00006  166.6   184.8   187.5   172.3

I've copied the output folder across to the Wallace Lab group folder on Eddie:

# copied the data across to group drive: 
[fanders6@login02(eddie) ~]$ cp -r /exports/eddie/scratch/fanders6/Wallace_2020_H99/output/ /exports/csce/eddie/biology/groups/wallace_rna/20200925_W-Cn-H99_2020

YAY :fireworks:

riboviz / example-datasets

Add Cryptococcus neoformans H99 dataflow from Wallace 2020 #3