NMDC long read assembly workflow

aclum commented 1 year ago

NMDC has a FY24 Q2 milestone of 'Assembly support for long-read sequence data'. JGI has an existing workflow in WDL which NMDC will adopt.

https://code.jgi.doe.gov/BFoster/jgi_meta_wdl/-/blob/master/metagenome_improved/metaflye.wdl Robert Riley is the point of contact for this workflow

aclum commented 1 year ago

Example filtering command (low input library example): JGI pipeline name PB Filter (pb_filter-15.py 1.5.5)

Command: [pipeline_bin_path]/filter/pb_filter-15.py -f /clusterfs/jgi/scratch/dsi/aa/dm_archive/sdm/pacbio/00/28/61/pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.fastq.gz -o [output_path] -ccs --dedup

Actual commands that get executed

#!/bin/bash -l

# need Pacbio smrtlink in the path pbmarkdup --log-level INFO -f -r pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.bam pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs. dedup.bam

# need bbtools in the path

# icecream filter - removes reads that are missing smrtlink adapters icecreamfinder.sh jni=t json=t ow=t cq=f keepshortreads=f trim=f ccs=t in=triangle.trim2.tmp.bam stats=triangle.json out=pbio-2861.29528.bc2002_OA--bc2002_OA.bc2 002_OA--bc2002_OA.ccs.unsorted.filter.bam outb=pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.bad.bam outa=pbio-2861.29528.bc2002_OA--bc2002_OA.bc 2002_OA--bc2002_OA.ccs.ambig.bam

# bbduk - trim out adapter from read ends bbduk.sh k=20 mink=12 edist=1 mm=f ktrimtips=60 ref=/bbmap/resources/PacBioAdapter.fa in=pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.dedup.bam out=triangle.trim.tmp.bam

# bbduk - removes reads that still contain adapter sequence bbduk.sh k=24 edist=1 mm=f ref=/bbmap/resources/PacBioAdapter.fa in=triangle.trim.tmp.bam out=pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.unsor ted.filter.bam

I will try and get an updated version of the filtering container, until this version can be used bryce911/rqc-pipeline:20230410 (this is version pb_filter-15.py -v 1.5.3

aclum commented 1 year ago

Emailed Stephan about an updated container version for rqc filter.

aclum commented 1 year ago

Database dependency are provided w/in the rqc-pipeline container /bbmap/resources/PacBioAdapter.fa

chienchi commented 1 year ago

The microbiomedata/bbtools:38.96 has been updated to microbiomedata/bbtools:39.01 which has the PacBioAdapter.fa in the path /bbmap/resources/

aclum commented 1 year ago

I think you'll need to use bryce911/rqc-pipeline which has both filtering wrapper script and adapters.

chienchi commented 1 year ago

I checked the pb_filter-15.py which will try to find bam file or using JAMO to query JGI server to restore the bam file?? If user only provides fastq.gz ccs file, it shows "Error: cannot find BAM xxxx". Just thought in generic, we can implement the four commands you provides as tasks in WDL.

aclum commented 1 year ago

That should be okay since the assembly wdl requires a fastq as input.

aclum commented 1 year ago

There is a new container with the latest RQC code bryce911/rqc-pipeline:20230914

ssarrafan commented 11 months ago

@hubin-keio @mshakya @kaijli any update on this issue? Today is the last day of the sprint. Please either close, move to next sprint or add the backlog label to it if you're not going to work it in the next few weeks.

kaijli commented 11 months ago

Not all of it is tested, but this branch of the ReadsQC repo now has a WDL for long reads, and the short reads related WDL files have been changed to work with Version 1.0. I've also combined the interleave related files with the original rqcfilter.wdl into shortReadsqc.wdl and included a simple WDL to choose between short and long reads.

ssarrafan commented 11 months ago

Not all of it is tested, but this branch of the ReadsQC repo now has a WDL for long reads, and the short reads related WDL files have been changed to work with Version 1.0. I've also combined the interleave related files with the original rqcfilter.wdl into shortReadsqc.wdl and included a simple WDL to choose between short and long reads.

Fantastic! I will move this to the new sprint to finish testing. Thanks for the update @kaijli

chienchi commented 11 months ago

For the long reads assembly, we intend to add this repo as submodule. During our testing, we identified some minor issues with the samtools and flye tasks. We have created an issue on the repo and hope the patch can be applied and we can update the pointer of submodules to the updated version.

ssarrafan commented 11 months ago

Appears to be in progress from GitLab issue. Will roll over to new sprint.

aclum commented 10 months ago

@kaijli @chienchi is there an update on this?

chienchi commented 10 months ago

I saw Robert has applied the patch and Brian merged it into master in the jgi_meta_wdl gitlab repo. We update the submodule pointer with the new merge commit. We will start working on a WDL which enables a switch for either illumina short reads or pacbio long reads.

ssarrafan commented 10 months ago

Appears to be active. Moving to new sprint.

aclum commented 10 months ago

@kaijli please provide an update here. Will you have time continue working on this next sprint?

kaijli commented 10 months ago

@aclum Yes, apologies. The wrapper is done and runs to completion on short reads and long reads (tested using the long reads test file provided at top of the thread). It's ready for the next steps.

ssarrafan commented 10 months ago

@kaijli can this issue be closed and a new issue be created for the next step?

kaijli commented 10 months ago

@ssarrafan Not entirely sure if I have the authority for this, but sure? Not familiar with how git issues and processes work 😅

aclum commented 10 months ago

Let's review this next week. I think we may need to add a task to generate an info file with assembly methods like we do for the other workflows. @kaijli do you have example output folder at nersc?

ssarrafan commented 9 months ago

@aclum @kaijli this issue has been open since July and no updates for the last 2 weeks so I'm removing this from the sprint and adding the backlog label to it.

If this will be actively addressed next sprint please add it to the new board.

kaijli commented 6 months ago

Posted update on ticket 454

kaijli commented 5 months ago

Submitted merge request for JGI repo

aclum commented 1 month ago

@kaijli can you make a separate ticket for implementing this in EDGE? This ticket was for the wdl development so I'd like to close.

microbiomedata / issues

NMDC long read assembly workflow #376