h3abionet / HPCBio-Refgraph_pipeline

0 stars 6 forks source link

Filtering & Annotation workflow #36

Open cjfields opened 3 years ago

cjfields commented 3 years ago

The first step in the workflow (assembly) is performed per sample and is in assembly.nf. @NeginValizadegan will work on the annotation steps for each sample assembly, with the basic steps:

  1. Filter reads to minimum given length (default = 500bp, same as used for HUPAN per Jess Bourne).
  2. BLAST runs against references, using (1) GRCh38 + alts + decoys (same as the alignment) (2) Sherman-Salzberg data, (3) others (CHM13?)
  3. RepeatMasker (see Kim's notes in the repo)
  4. QUAST
  5. Contamination detection. We have used Kraken2 for this (code is on Gloria's branch), but we may want to check what HUPAN is doing here, which I believe is BLASTN

Any others?

cjfields commented 3 years ago

@NeginValizadegan maybe start with simple bash scripts first for testing the steps, then add to a nextflow script.

NeginValizadegan commented 3 years ago

Linking commits 7374f498e0060c999e96224d99315ef181b79ed3 and 696cbcffb846ef9de93850958792984ee995fbd2 here.

NeginValizadegan commented 3 years ago

There are some memory-related issue with blastn step. Job was killed at 10GB memory and bus error at 40, 100, and even 150 GB. Still troubleshooting.

cjfields commented 3 years ago

@NeginValizadegan re: the BLASTN work (and the annotation steps in general), I'm guessing you are trying to run all the annotation steps in one bash script? I'd recommend keeping it simple and testing out each step in an independent bash script; these can be independently moved into nextflow process blocks when they are working.

So for example you have the seqkit step in the annotation.sh bash script. You can try running BLASTN in a separate blastn.sh bash script, RepeatMasker in rm.sh, etc. The inputs (FASTA files) will largely be the same for all of these.

NeginValizadegan commented 3 years ago

@ChrisFields Yes, but I set it up so that I can deactivate specific steps so not running it all at once but putting it all in one script. At the end of the script, I have the main section which allows me to comment out the steps I don't want to run easily.

NeginValizadegan commented 3 years ago

Linking commit 627872f8d600c4af493243942096cfebb906187e here. Sorry forgot to add #36

NeginValizadegan commented 3 years ago

Linking a3476ca061a8c095e5039c484734c9c98b6fe884 here

NeginValizadegan commented 3 years ago

Linking 321f764cb3eafcbdca4a204c5fb15ded692f5063

NeginValizadegan commented 2 years ago

Linking 1983b415c92a9adc8af57a86fe91252776ed88b5 here

cjfields commented 2 years ago

For example, you can do this to see the last commit: fc187b9

NeginValizadegan commented 2 years ago

Linking 0f296541908ed1a4b0b21024aed285f38ad3017b here.

NeginValizadegan commented 2 years ago

Linking 3c7310a2fb548852c926bb2a12ae7fbf33631f56 here.

NeginValizadegan commented 2 years ago

Linking cc16dc6af2356fc98955785328e86e3d85fe9d57 here.

NeginValizadegan commented 2 years ago

Linking db039c224bec35c5af38e955d2dc309acc320629 and 4eecbe5aed58ba726a330dd0b3b15b3a0f394605 here.

NeginValizadegan commented 2 years ago

Linking 4a9b3f244474177fafa79086f9acfc9ca45bda6e here (pipeline testing).