Open aclum opened 1 year ago
Example filtering command (low input library example): JGI pipeline name PB Filter (pb_filter-15.py 1.5.5)
Command: [pipeline_bin_path]/filter/pb_filter-15.py -f /clusterfs/jgi/scratch/dsi/aa/dm_archive/sdm/pacbio/00/28/61/pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.fastq.gz -o [output_path] -ccs --dedup
Actual commands that get executed
#!/bin/bash -l
# need Pacbio smrtlink in the path pbmarkdup --log-level INFO -f -r pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.bam pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs. dedup.bam
# need bbtools in the path
# icecream filter - removes reads that are missing smrtlink adapters icecreamfinder.sh jni=t json=t ow=t cq=f keepshortreads=f trim=f ccs=t in=triangle.trim2.tmp.bam stats=triangle.json out=pbio-2861.29528.bc2002_OA--bc2002_OA.bc2 002_OA--bc2002_OA.ccs.unsorted.filter.bam outb=pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.bad.bam outa=pbio-2861.29528.bc2002_OA--bc2002_OA.bc 2002_OA--bc2002_OA.ccs.ambig.bam
# bbduk - trim out adapter from read ends bbduk.sh k=20 mink=12 edist=1 mm=f ktrimtips=60 ref=/bbmap/resources/PacBioAdapter.fa in=pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.dedup.bam out=triangle.trim.tmp.bam
# bbduk - removes reads that still contain adapter sequence bbduk.sh k=24 edist=1 mm=f ref=/bbmap/resources/PacBioAdapter.fa in=triangle.trim.tmp.bam out=pbio-2861.29528.bc2002_OA--bc2002_OA.bc2002_OA--bc2002_OA.ccs.unsor ted.filter.bam
I will try and get an updated version of the filtering container, until this version can be used bryce911/rqc-pipeline:20230410 (this is version pb_filter-15.py -v 1.5.3
Emailed Stephan about an updated container version for rqc filter.
Database dependency are provided w/in the rqc-pipeline container /bbmap/resources/PacBioAdapter.fa
The microbiomedata/bbtools:38.96 has been updated to microbiomedata/bbtools:39.01 which has the PacBioAdapter.fa in the path /bbmap/resources/
I think you'll need to use bryce911/rqc-pipeline which has both filtering wrapper script and adapters.
I checked the pb_filter-15.py which will try to find bam file or using JAMO to query JGI server to restore the bam file?? If user only provides fastq.gz ccs file, it shows "Error: cannot find BAM xxxx". Just thought in generic, we can implement the four commands you provides as tasks in WDL.
That should be okay since the assembly wdl requires a fastq as input.
There is a new container with the latest RQC code bryce911/rqc-pipeline:20230914
@hubin-keio @mshakya @kaijli any update on this issue? Today is the last day of the sprint. Please either close, move to next sprint or add the backlog label to it if you're not going to work it in the next few weeks.
Not all of it is tested, but this branch of the ReadsQC repo now has a WDL for long reads, and the short reads related WDL files have been changed to work with Version 1.0. I've also combined the interleave related files with the original rqcfilter.wdl into shortReadsqc.wdl and included a simple WDL to choose between short and long reads.
Not all of it is tested, but this branch of the ReadsQC repo now has a WDL for long reads, and the short reads related WDL files have been changed to work with Version 1.0. I've also combined the interleave related files with the original rqcfilter.wdl into shortReadsqc.wdl and included a simple WDL to choose between short and long reads.
Fantastic! I will move this to the new sprint to finish testing. Thanks for the update @kaijli
Appears to be in progress from GitLab issue. Will roll over to new sprint.
@kaijli @chienchi is there an update on this?
I saw Robert has applied the patch and Brian merged it into master in the jgi_meta_wdl gitlab repo. We update the submodule pointer with the new merge commit. We will start working on a WDL which enables a switch for either illumina short reads or pacbio long reads.
Appears to be active. Moving to new sprint.
@kaijli please provide an update here. Will you have time continue working on this next sprint?
@aclum Yes, apologies. The wrapper is done and runs to completion on short reads and long reads (tested using the long reads test file provided at top of the thread). It's ready for the next steps.
@kaijli can this issue be closed and a new issue be created for the next step?
@ssarrafan Not entirely sure if I have the authority for this, but sure? Not familiar with how git issues and processes work 😅
Let's review this next week. I think we may need to add a task to generate an info file with assembly methods like we do for the other workflows. @kaijli do you have example output folder at nersc?
@aclum @kaijli this issue has been open since July and no updates for the last 2 weeks so I'm removing this from the sprint and adding the backlog label to it.
If this will be actively addressed next sprint please add it to the new board.
Posted update on ticket 454
Submitted merge request for JGI repo
@kaijli can you make a separate ticket for implementing this in EDGE? This ticket was for the wdl development so I'd like to close.
NMDC has a FY24 Q2 milestone of 'Assembly support for long-read sequence data'. JGI has an existing workflow in WDL which NMDC will adopt.
https://code.jgi.doe.gov/BFoster/jgi_meta_wdl/-/blob/master/metagenome_improved/metaflye.wdl Robert Riley is the point of contact for this workflow