riboviz / example-datasets

Example datasets to run with RiboViz
Apache License 2.0
2 stars 7 forks source link

Explain paths and fastq-dump #13

Closed ewallace closed 3 years ago

ewallace commented 4 years ago

In order to practically run any example dataset, everything needs to be in the right place relative to the command prompt. That means:

All this needs documentation.

ewallace commented 4 years ago

Update: I did not find a fastq-dump module on Eddie. I asked on the Edinburgh bioformatics slack what others do to get data from SRA or ENA onto Eddie.

XUEXUEXUE0 commented 4 years ago

Actually Eddie has this tool. Type this command line $ module load igmm/apps/sratoolkit/2.8.2-1 then you can use fastq-dump. But I found it is too slow for fastq-dump to download a dataset like H99 and JEC21 which are around 50GB before compressed. Even we can use the --gzip option to directly download the .gz file. It is still too slow.

I think the fastest way to download datasets is to use fasterq-dumpand Aspera client

The version of sratoolkits is old on Eddie and the kits did not have the fasterq-dump

So I first intstalled the latest version of sratoolkits following the documentation https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit. Then use the following command to get the dataset. It is really fast and takes only around 7 min to get a 50GB file

#prefetch with Aspera client
$ prefetch SRR9620588
$ fasterq-dump SRR9620588

But the fasterq-dump did not have the --gzip option, so you have to compress the datasets after you download it. And I use pigz to compress these datasets

ewallace commented 4 years ago

So when you ran the Cryptococcus example datasets on Eddie for #3, what exactly did you do? Did you change the input path in the config.yaml? Did you make symlinks between a Wallace_2020_H99/input directory and the fastq files on scratchspace?

This is an important practical consideration, related to riboviz#59.

XUEXUEXUE0 commented 4 years ago

The config.yaml I used when I ran the datasets is exactly the same as the one I committed to github.

In this case

Finally run python -m riboviz.tools.prep_riboviz -c Wallace_2020_H99/< Your config.yaml> in riboviz directory

kavousan commented 4 years ago

Moved to To do pending @ewallace review of #59; @FlicAnderson to also test this.

FlicAnderson commented 4 years ago

Just pinging this to myself @FlicAnderson re: example-datasets/#3 work still to do, to remind me what @XUEXUEXUE0 did previously.

ewallace commented 4 years ago

9 Sept dev meeting, @mikej888 suggested using relative paths in the config.yaml file, and also creating symlinks that point to the absolute paths. I don't know how to do this, but it sounds like a good idea, so we should test and document. To discuss with @FlicAnderson next week.

mikej888 commented 4 years ago

I tried a simple example on my local machine. When I originally tried this, it revealed a bug in generate_stats_figs.R and how it imports other R scripts (see generate_stats_figs.R cannot find R scripts to import if used with Nextflow and non-local work directory riboviz/riboviz#210). The following assumes that the bug fix, currently in nextflow-210 branch, has been applied.

Create scratch directories:

$ mkdir -p /tmp/scratch/data
$ mkdir -p /tmp/scratch/input

Copy data over:

$ cp -r data/ /tmp/scratch/
$ cp -r vignette/input/ /tmp/scratch/

Create a symbolic link, in $HOME/riboviz/, called local, which links to the scratch directory:

$ ln -s /tmp/scratch/ local
$ ls local
data  input
$ ls -l local
lrwxrwxrwx 1 ubuntu ubuntu 13 Sep  9 09:26 local -> /tmp/scratch/

Note that the symbolic link is shown.

Copy vignette/vignette_config.yaml to local_config.yaml and edit to refer to files in the relative, local, directory:

asite_disp_length_file: local/data/yeast_standard_asite_disp_length.txt

codon_positions_file: local/data/yeast_codon_pos_i200.RData

dir_index: local/index
dir_in: local/input
dir_logs: local/logs
dir_out: local/output
dir_tmp: local/tmp

features_file: local/data/yeast_features.tsv

orf_fasta_file: local/input/yeast_YAL_CDS_w_250utrs.fa
orf_gff_file: local/input/yeast_YAL_CDS_w_250utrs.gff3

rrna_fasta_file: local/input/yeast_rRNA_R64-1-1.fa

t_rna_file: local/data/yeast_tRNAs.tsv

Run Nextflow to validate configuration. Note that by default Nextflow uses a local work/ directory to write its intermediate results to, which is relative to the current directory in which Nextflow is invoked. Here, Nextflow's -work-dir flag is used to instruct Nextflow where to put these results, in this case within /tmp/scratch/work (one could use local/work here instead too).

$ nextflow run prep_riboviz.nf -params-file local_config.yaml -work-dir /tmp/scratch/work -ansi-log false --validate_only
...
Validated configuration

Run Nextflow workflow:

$ nextflow run prep_riboviz.nf -params-file local_config.yaml -work-dir /tmp/scratch/work -ansi-log false
...
Workflow finished! (OK)
$ ls local
data  index  input  output  tmp  work

Check results:

$ pytest riboviz/test/regression/test_regression.py --expected=$HOME/regression-test-data-2.0/  --skip-workflow --nextflow --config-file=local_config.yaml
...
================== 40 passed, 48 skipped, 1 warning in 1.41s ===================

Python workflow example:

$ rm -rf local/work local/index/ local/output local/tmp 
$ python -m riboviz.tools.prep_riboviz -c local_config.yaml
$ pytest riboviz/test/regression/test_regression.py --expected=$HOME/regression-test-data-2.0/  --skip-workflow --config-file=local_config.yaml
...
================== 40 passed, 48 skipped, 1 warning in 1.36s ===================
ewallace commented 4 years ago

This is great! Wouldn't it be easier to keep the annotation files in data wherever it is though (vignette or example-datasets/...), just symlink to those with ln -s? I'm thinking about the "information should only be in one place" principle, but that might be overkill.

mikej888 commented 4 years ago

Yes. Or one could leave the input files (data/ and vignette/input/ etc.) and leave their configuration as-is

One can use symbolic links only for the index, tmp and output directories.

FlicAnderson commented 4 years ago

Followed your instructions locally @mikej888 and got the same results. Super handy primer on how symlinks work, how to make them, and what they're handy for.

Currently working through how best to set things up to run on Eddie using this approach for #3 and #4.

FlicAnderson commented 4 years ago

Ran the symlink'd vignette example on my home directory on Eddie (in an interactive node currently since the vignette is a short run even if it works!), with input and data files copied across to my scratch space on there, then symlinked back into $HOME/riboviz/local folder as described.

Success! Regression test says so: 40 passed, 48 skipped, 1 warning in 3.41s :cake:

Next steps: try with larger data (Cryptococcus #3 and #4 here I come...) and potentially see if I can do separate symlinks for the example-dataset files (annotation and contaminants) and the sample reads files (.fastq.gz) to see if this is feasible.

And, naturally, document everything that has worked!

FlicAnderson commented 3 years ago

After much struggling, I've realised that I was trying to do something unnecessary and probably getting it wrong.

My nextflow attempts for even nextflow validation were failing because I'd been trying to set up system links between the files in the example-datasets repo and a folder I'd created which symlinked to a place on scratch (which held input .fastq files). This didn't work, in a number of exotic and exciting ways, including infinite folders like some horrendous hall of mirrors.

Turns out that there was no reason for this madness, and my proof of concept nextflow vignette with system links works if I don't try it.

I've created a test job submission for the vignette (nextflow method), using an edited vignette. The resource allocation settings haven't yet been finessed, so I think these are still a bit much, but it works without aborting.

run_nextflow-vignette.txt local_vignette_config.txt

Completed successfully: Queue = eddie@node3g14.ecdf.ed.ac.uk Host = node3g14.ecdf.ed.ac.uk Start Time = 09/26/2020 01:21:51.501 End Time = 09/26/2020 01:24:19.233 User Time = 00:04:03 System Time = 00:00:17 Wallclock Time = 00:02:27 CPU = 00:04:21 Max vmem = 42.301G Max rss = NA Exit Status = 0

FlicAnderson commented 3 years ago

This should be closed after PR for riboviz/#59 since that's the best place to put the information into the documentation I think.

FlicAnderson commented 3 years ago

Added information from this issue to the documentation in run-on-eddie.md and submitted PR for riboviz/#59