Closed ewallace closed 3 years ago
Update: I did not find a fastq-dump
module on Eddie. I asked on the Edinburgh bioformatics slack what others do to get data from SRA or ENA onto Eddie.
Actually Eddie has this tool. Type this command line $ module load igmm/apps/sratoolkit/2.8.2-1
then you can use fastq-dump
. But I found it is too slow for fastq-dump to download a dataset like H99 and JEC21 which are around 50GB before compressed. Even we can use the --gzip option to directly download the .gz file. It is still too slow.
I think the fastest way to download datasets is to use fasterq-dump
and Aspera client
The version of sratoolkits is old on Eddie and the kits did not have the fasterq-dump
So I first intstalled the latest version of sratoolkits following the documentation https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit. Then use the following command to get the dataset. It is really fast and takes only around 7 min to get a 50GB file
#prefetch with Aspera client
$ prefetch SRR9620588
$ fasterq-dump SRR9620588
But the fasterq-dump did not have the --gzip option, so you have to compress the datasets after you download it. And I use pigz
to compress these datasets
So when you ran the Cryptococcus example datasets on Eddie for #3, what exactly did you do? Did you change the input path in the config.yaml? Did you make symlinks between a Wallace_2020_H99/input
directory and the fastq files on scratchspace?
This is an important practical consideration, related to riboviz#59.
The config.yaml I used when I ran the datasets is exactly the same as the one I committed to github.
In this case
Wallace_2020_H99
in /exports/eddie/scratch/s1919303/riboviz/
first.input
in Wallace_2020_H99
and put SRR files in input
contaminants
in Wallace_2020_H99
and put the comtaminants files in contaminants
annotation
in Wallace_2020_H99
and put the annotation files in annotation
Wallace_2020_H99
Finally run python -m riboviz.tools.prep_riboviz -c Wallace_2020_H99/< Your config.yaml>
in riboviz
directory
Moved to To do pending @ewallace review of #59; @FlicAnderson to also test this.
Just pinging this to myself @FlicAnderson re: example-datasets/#3 work still to do, to remind me what @XUEXUEXUE0 did previously.
9 Sept dev meeting, @mikej888 suggested using relative paths in the config.yaml
file, and also creating symlinks that point to the absolute paths. I don't know how to do this, but it sounds like a good idea, so we should test and document. To discuss with @FlicAnderson next week.
I tried a simple example on my local machine. When I originally tried this, it revealed a bug in generate_stats_figs.R
and how it imports other R scripts (see generate_stats_figs.R
cannot find R scripts to import if used with Nextflow and non-local work
directory riboviz/riboviz#210). The following assumes that the bug fix, currently in nextflow-210
branch, has been applied.
Create scratch directories:
$ mkdir -p /tmp/scratch/data
$ mkdir -p /tmp/scratch/input
Copy data over:
$ cp -r data/ /tmp/scratch/
$ cp -r vignette/input/ /tmp/scratch/
Create a symbolic link, in $HOME/riboviz/
, called local
, which links to the scratch directory:
$ ln -s /tmp/scratch/ local
$ ls local
data input
$ ls -l local
lrwxrwxrwx 1 ubuntu ubuntu 13 Sep 9 09:26 local -> /tmp/scratch/
Note that the symbolic link is shown.
Copy vignette/vignette_config.yaml
to local_config.yaml
and edit to refer to files in the relative, local
, directory:
asite_disp_length_file: local/data/yeast_standard_asite_disp_length.txt
codon_positions_file: local/data/yeast_codon_pos_i200.RData
dir_index: local/index
dir_in: local/input
dir_logs: local/logs
dir_out: local/output
dir_tmp: local/tmp
features_file: local/data/yeast_features.tsv
orf_fasta_file: local/input/yeast_YAL_CDS_w_250utrs.fa
orf_gff_file: local/input/yeast_YAL_CDS_w_250utrs.gff3
rrna_fasta_file: local/input/yeast_rRNA_R64-1-1.fa
t_rna_file: local/data/yeast_tRNAs.tsv
Run Nextflow to validate configuration. Note that by default Nextflow uses a local work/
directory to write its intermediate results to, which is relative to the current directory in which Nextflow is invoked. Here, Nextflow's -work-dir
flag is used to instruct Nextflow where to put these results, in this case within /tmp/scratch/work
(one could use local/work
here instead too).
$ nextflow run prep_riboviz.nf -params-file local_config.yaml -work-dir /tmp/scratch/work -ansi-log false --validate_only
...
Validated configuration
Run Nextflow workflow:
$ nextflow run prep_riboviz.nf -params-file local_config.yaml -work-dir /tmp/scratch/work -ansi-log false
...
Workflow finished! (OK)
$ ls local
data index input output tmp work
Check results:
$ pytest riboviz/test/regression/test_regression.py --expected=$HOME/regression-test-data-2.0/ --skip-workflow --nextflow --config-file=local_config.yaml
...
================== 40 passed, 48 skipped, 1 warning in 1.41s ===================
Python workflow example:
$ rm -rf local/work local/index/ local/output local/tmp
$ python -m riboviz.tools.prep_riboviz -c local_config.yaml
$ pytest riboviz/test/regression/test_regression.py --expected=$HOME/regression-test-data-2.0/ --skip-workflow --config-file=local_config.yaml
...
================== 40 passed, 48 skipped, 1 warning in 1.36s ===================
This is great! Wouldn't it be easier to keep the annotation files in data
wherever it is though (vignette
or example-datasets/...
), just symlink to those with ln -s
? I'm thinking about the "information should only be in one place" principle, but that might be overkill.
Yes. Or one could leave the input files (data/
and vignette/input/
etc.) and leave their configuration as-is
One can use symbolic links only for the index
, tmp
and output
directories.
Followed your instructions locally @mikej888 and got the same results. Super handy primer on how symlinks work, how to make them, and what they're handy for.
Currently working through how best to set things up to run on Eddie using this approach for #3 and #4.
Ran the symlink'd vignette example on my home directory on Eddie (in an interactive node currently since the vignette is a short run even if it works!), with input and data files copied across to my scratch space on there, then symlinked back into $HOME/riboviz/local folder as described.
Success! Regression test says so: 40 passed, 48 skipped, 1 warning in 3.41s :cake:
Next steps: try with larger data (Cryptococcus #3 and #4 here I come...) and potentially see if I can do separate symlinks for the example-dataset files (annotation and contaminants) and the sample reads files (.fastq.gz) to see if this is feasible.
And, naturally, document everything that has worked!
After much struggling, I've realised that I was trying to do something unnecessary and probably getting it wrong.
My nextflow attempts for even nextflow validation were failing because I'd been trying to set up system links between the files in the example-datasets repo and a folder I'd created which symlinked to a place on scratch (which held input .fastq files). This didn't work, in a number of exotic and exciting ways, including infinite folders like some horrendous hall of mirrors.
Turns out that there was no reason for this madness, and my proof of concept nextflow vignette with system links works if I don't try it.
I've created a test job submission for the vignette (nextflow method), using an edited vignette. The resource allocation settings haven't yet been finessed, so I think these are still a bit much, but it works without aborting.
run_nextflow-vignette.txt local_vignette_config.txt
Completed successfully: Queue = eddie@node3g14.ecdf.ed.ac.uk Host = node3g14.ecdf.ed.ac.uk Start Time = 09/26/2020 01:21:51.501 End Time = 09/26/2020 01:24:19.233 User Time = 00:04:03 System Time = 00:00:17 Wallclock Time = 00:02:27 CPU = 00:04:21 Max vmem = 42.301G Max rss = NA Exit Status = 0
This should be closed after PR for riboviz/#59 since that's the best place to put the information into the documentation I think.
Added information from this issue to the documentation in run-on-eddie.md
and submitted PR for riboviz/#59
In order to practically run any example dataset, everything needs to be in the right place relative to the command prompt. That means:
prep_riboviz.nf
also needs to be called from the correct relative/absolute pathAll this needs documentation.