Open davemcg opened 2 years ago
thanks @davemcg for the thorough notes on your instantiation of monorail and the input into recount3! glad to know it's useful for someone even with the incomplete documentation, though we'll attempt to continually improve that.
how is it possible I only get read counts on the F006 gene sums files and not in all the rest of them? Can you help me, please?
@davemcg Just need to say thank you for these notes! I was having tremendous trouble with an error like this
MissingInputException in line 164 of /recount-unify/Snakefile:
Missing input files for rule collect_pasted_sums:
staging/all.exon_bw_count.dy._7.pasted
staging/all.exon_bw_count.dy._9.pasted
and your clue about:
Do not have fastq files (local) that have a non-character / non-number within two of the end of the "core name"
- e.g. A1_1_R1.fastq.gz and A1_2_R2.fastq.gz seem to cause issues with unify as pump uses the last two of the name (in this case _1) as a subfolder. Which seems to cause issues with some concatenation steps later
also seems to apply to study and sample names - a sample id like CNC12_1
causes errors, but CNC121
is fine.
Thanks @lonsbio
If it helps, I have a standalone Snakefile which I use now. For my own purposes, so it is very much not "plug and play". But it can be adapted for personal use without a ton of trouble (if you are semi-familiar with Snakemake).
@davemcg I went ahead and linked these thorough notes and your Snakerail repo in the main README. Please let me know if that's an issue for you.
Current link (the dir also contains the shell scripts I used): https://github.com/davemcg/metaRPE/blob/main/recount3/README.md
Reposted with a touch of tweaks to make a bit more general:
How to run monorail pump and unify
Observations, some unfounded and poorly understand
unify
aspump
uses the last two of the name (in this case _1) as a subfolder. Which seems to cause issues with some concatenation steps laterunify
will use ALL THE FOLDERS/SAMPLES in thepump
output as input forunify
. So you cannot "mix and match" custom unify output by just messing with the sample metadata file.Workflow
cd /data/mcgaugheyd/projects/nei/bharti/metaRPE
cp /home/mcgaugheyd/metaRPE/recount3/get_image.sh . ; bash get_image.sh # get both pump and unify singularity images
bash ~/git/monorail-external/get_unify_refs.sh hg38 # unify refs
bash ~/git/monorail-external/get_human_ref_indexes.sh
mkdir references; mv hg38 references/; mv hg38_unify references/ # mv pump and unify ref folders to subfolder
cp /home/mcgaugheyd/metaRPE/recount3/run_pump.sh . ;cp /home/mcgaugheyd/metaRPE/recount3/pump_commands.sh .
bash pump_commands.sh # runs all pump jobs (invoking the run_pump.sh script)
mkdir pump_output; rsync --progress -rav pump/*/output/ pump_output # consolidate the pump outputs to one directory
mkdir unify_output; mv recount-unify_1.0.9.sif unify_output; cd unify_output; cp /home/mcgaugheyd/git/metaRPE/data/recount_sample_metadata.tsv . ; cp /home/mcgaugheyd/metaRPE/recount3/run_unify.sh .
sbatch --cpus-per-task 6 --mem=32G run_unify.sh
How to munge monorail-unify output into recount3 format
Why bother? Well, the unify output is not straight counts, but rather the sum of the base pair level coverage. recount3 has tooling to do the conversion. I briefly looked into whether I could "directly" do the transform but after a few minutes of traversing through their code, it seemed easier to just make the RangedSummerizedExperiment (RSE) data structure by moving the unify outputs around. Plus the RSE is a more portable and useful format should someone else want to use the data.
OK, so super briefly you have to move the unify outputs around a bit (to my local computer in my example) so recount3 can import the data and make the RSE and then you can run (in R)
recount3::transform_counts
to get the straight counts for downstream use.Their example directory is here: http://snaptron.cs.jhu.edu/data/temp/recount3test/
Fairly complete instructions: https://github.com/langmead-lab/monorail-external#loading-custom-unifier-runs-into-recount3
My directory structure (gaps are where I've deleted lines to make a touch shorter to view):
Notes:
annotation
folder bywget
ing the files from Langmead and cowget http://duffel.rail.bio/recount3/human/new_annotations/gene_sums/human.gene_sums.G026.gtf.gz
metaRPE
is my project name. monorail/recount tend to usesra
as theirs.data_sources
folder is filled byrsync
ing various unify output folders/data/mcgaugheyd/projects/nei/bharti/metaRPE/unify_output
base_sums
is currently empty as recount3 doesn't need the bigWig files (which are in the pump outputs)exon_sums_per_study
renamed toexon_sums
gene_sums_per_study
renamed togene_sums
junction_counts_per_study
renamed tojunctions
metadata
doesn't need to be renamedmetaRPE.recount_project.MD.gz
zcat metadata/*/*/*recount_project.* | head -n 1 | gzip > metadata/metaRPE.recount_project.MD.gz # this grabs just the header
zcat metadata/*/*/*recount_project.* | grep -v rail_id | gzip >> metadata/metaRPE.recount_project.MD.gz # copy the rest of the meta sans the headers
home_index
file is just a text file withdata_sources/metaRPE
in itmetaRPE
with whatever your "project" name is (again,sra
is commonly used by the monorail/recount team)