langmead-lab / monorail-external

examples to run monorail externally
MIT License
13 stars 5 forks source link

Notes for future users of monorail #10

Open davemcg opened 2 years ago

davemcg commented 2 years ago

Current link (the dir also contains the shell scripts I used): https://github.com/davemcg/metaRPE/blob/main/recount3/README.md

Reposted with a touch of tweaks to make a bit more general:

How to run monorail pump and unify

Observations, some unfounded and poorly understand

  1. Do not have fastq files (local) that have a non-character / non-number within two of the end of the "core name"
    • e.g. A1_1_R1.fastq.gz and A1_2_R2.fastq.gz seem to cause issues with unify as pump uses the last two of the name (in this case _1) as a subfolder. Which seems to cause issues with some concatenation steps later
  2. Just make the project names one case, maybe lower? I had some weird issues when I had project names like "Bob" which were perhaps resolved when I changed them to "bob"?
  3. VERY IMPORTANT: unify will use ALL THE FOLDERS/SAMPLES in the pump output as input for unify. So you cannot "mix and match" custom unify output by just messing with the sample metadata file.

Workflow

  1. cd /data/mcgaugheyd/projects/nei/bharti/metaRPE
  2. cp /home/mcgaugheyd/metaRPE/recount3/get_image.sh . ; bash get_image.sh # get both pump and unify singularity images
  3. bash ~/git/monorail-external/get_unify_refs.sh hg38 # unify refs
  4. bash ~/git/monorail-external/get_human_ref_indexes.sh
  5. mkdir references; mv hg38 references/; mv hg38_unify references/ # mv pump and unify ref folders to subfolder
  6. cp /home/mcgaugheyd/metaRPE/recount3/run_pump.sh . ;cp /home/mcgaugheyd/metaRPE/recount3/pump_commands.sh .
  7. bash pump_commands.sh # runs all pump jobs (invoking the run_pump.sh script)
    • outputs in pump/[lane_file] so you can simultaneously run the pump commands and avoid overlapping snakemake job rage
    • wait until the job finishes (the slower ones take 2-3 hours)
  8. mkdir pump_output; rsync --progress -rav pump/*/output/ pump_output # consolidate the pump outputs to one directory
  9. mkdir unify_output; mv recount-unify_1.0.9.sif unify_output; cd unify_output; cp /home/mcgaugheyd/git/metaRPE/data/recount_sample_metadata.tsv . ; cp /home/mcgaugheyd/metaRPE/recount3/run_unify.sh .
  10. sbatch --cpus-per-task 6 --mem=32G run_unify.sh

How to munge monorail-unify output into recount3 format

Why bother? Well, the unify output is not straight counts, but rather the sum of the base pair level coverage. recount3 has tooling to do the conversion. I briefly looked into whether I could "directly" do the transform but after a few minutes of traversing through their code, it seemed easier to just make the RangedSummerizedExperiment (RSE) data structure by moving the unify outputs around. Plus the RSE is a more portable and useful format should someone else want to use the data.

OK, so super briefly you have to move the unify outputs around a bit (to my local computer in my example) so recount3 can import the data and make the RSE and then you can run (in R)recount3::transform_counts to get the straight counts for downstream use.

Their example directory is here: http://snaptron.cs.jhu.edu/data/temp/recount3test/

Fairly complete instructions: https://github.com/langmead-lab/monorail-external#loading-custom-unifier-runs-into-recount3

My directory structure (gaps are where I've deleted lines to make a touch shorter to view):

└── human
    ├── annotations
    │   ├── exon_sums
    │   │   ├── human.exon_sums.G026.gtf.gz
    │   │   └── human.exon_sums.G029.gtf.gz
    │   └── gene_sums
    │       ├── human.gene_sums.G026.gtf.gz
    │       └── human.gene_sums.G029.gtf.gz
    ├── data_sources
    │   └── metaRPE
    │       ├── base_sums
    │       ├── exon_sums
    │       │   ├── de
    │       │   │   └── davide
    │       │   │       ├── metaRPE.exon_sums.davide.ERCC.gz
    │       │   │       ├── metaRPE.exon_sums.davide.F006.gz
    │       │   │       ├── metaRPE.exon_sums.davide.G026.gz
    │       │   │       ├── metaRPE.exon_sums.davide.G029.gz
    │       │   │       ├── metaRPE.exon_sums.davide.R109.gz
    │       │   │       └── metaRPE.exon_sums.davide.SIRV.gz
    │       │   ├── hi
    │       │   │   └── ruchi
    │       │   │       ├── metaRPE.exon_sums.ruchi.ERCC.gz
    │       │   │       ├── metaRPE.exon_sums.ruchi.F006.gz

    │       ├── gene_sums
    │       │   ├── de
    │       │   │   └── davide
    │       │   │       ├── metaRPE.gene_sums.davide.ERCC.gz
    │       │   │       ├── metaRPE.gene_sums.davide.F006.gz
    │       │   │       ├── metaRPE.gene_sums.davide.G026.gz
    │       │   │       ├── metaRPE.gene_sums.davide.G029.gz

    │       ├── junctions
    │       │   ├── de
    │       │   │   └── davide
    │       │   │       ├── metaRPE.junctions.davide.ALL.ID.gz
    │       │   │       ├── metaRPE.junctions.davide.ALL.MM.gz
    │       │   │       ├── metaRPE.junctions.davide.ALL.RR.gz

    │       └── metadata
    │           ├── de
    │           │   └── davide
    │           │       ├── metaRPE.metaRPE.davide.MD.gz
    │           │       ├── metaRPE.recount_project.davide.MD.gz
    │           │       └── metaRPE.recount_qc.davide.MD.gz

    │           ├── in
    │           │   └── qin
    │           │       ├── metaRPE.metaRPE.qin.MD.gz
    │           │       ├── metaRPE.recount_project.qin.MD.gz
    │           │       └── metaRPE.recount_qc.qin.MD.gz
    │           ├── la
    │           │   └── karla
    │           │       ├── metaRPE.metaRPE.karla.MD.gz
    │           │       ├── metaRPE.recount_project.karla.MD.gz
    │           │       └── metaRPE.recount_qc.karla.MD.gz
    │           └── metaRPE.recount_project.MD.gz
    └── home_index

Notes:

  1. Top level folder is organism (human or mouse)
  2. Fill annotation folder by wgeting the files from Langmead and co
    • Example for G026 gene sums: wget http://duffel.rail.bio/recount3/human/new_annotations/gene_sums/human.gene_sums.G026.gtf.gz
  3. metaRPE is my project name. monorail/recount tend to use sra as theirs.
  4. The data_sources folder is filled by rsyncing various unify output folders
    • in my case it is /data/mcgaugheyd/projects/nei/bharti/metaRPE/unify_output
    • base_sums is currently empty as recount3 doesn't need the bigWig files (which are in the pump outputs)
    • exon_sums_per_study renamed to exon_sums
    • gene_sums_per_study renamed to gene_sums
    • junction_counts_per_study renamed to junctions
    • metadata doesn't need to be renamed
      • HOWEVER you need to "hand make" metaRPE.recount_project.MD.gz
        • My janky ass code that I ran in the unify output folder (see path above):
        • zcat metadata/*/*/*recount_project.* | head -n 1 | gzip > metadata/metaRPE.recount_project.MD.gz # this grabs just the header
        • zcat metadata/*/*/*recount_project.* | grep -v rail_id | gzip >> metadata/metaRPE.recount_project.MD.gz # copy the rest of the meta sans the headers
  5. The home_index file is just a text file with data_sources/metaRPE in it
    • replace metaRPE with whatever your "project" name is (again, sra is commonly used by the monorail/recount team)
ChristopherWilks commented 2 years ago

thanks @davemcg for the thorough notes on your instantiation of monorail and the input into recount3! glad to know it's useful for someone even with the incomplete documentation, though we'll attempt to continually improve that.

SKarr07 commented 1 year ago

how is it possible I only get read counts on the F006 gene sums files and not in all the rest of them? Can you help me, please?

lonsbio commented 10 months ago

@davemcg Just need to say thank you for these notes! I was having tremendous trouble with an error like this

MissingInputException in line 164 of /recount-unify/Snakefile:
Missing input files for rule collect_pasted_sums:
staging/all.exon_bw_count.dy._7.pasted
staging/all.exon_bw_count.dy._9.pasted

and your clue about:

Do not have fastq files (local) that have a non-character / non-number within two of the end of the "core name"
- e.g. A1_1_R1.fastq.gz and A1_2_R2.fastq.gz seem to cause issues with unify as pump uses the last two of the name (in this case _1) as a subfolder. Which seems to cause issues with some concatenation steps later

also seems to apply to study and sample names - a sample id like CNC12_1 causes errors, but CNC121 is fine.

davemcg commented 10 months ago

Thanks @lonsbio

If it helps, I have a standalone Snakefile which I use now. For my own purposes, so it is very much not "plug and play". But it can be adapted for personal use without a ton of trouble (if you are semi-familiar with Snakemake).

https://github.com/davemcg/Snakerail

ChristopherWilks commented 10 months ago

@davemcg I went ahead and linked these thorough notes and your Snakerail repo in the main README. Please let me know if that's an issue for you.