h3abionet / HPCBio-Refgraph_pipeline

0 stars 6 forks source link

Test assemblies from bowtie2 alignment #31

Closed cjfields closed 3 years ago

cjfields commented 3 years ago

We are seeing very few contigs coming through the current workflow, though I believe these can be attributed to two key factors:

  1. The reference genome used is GRCh38 + alts and decoys; while the Salzberg paper uses only GRCh38 w/o alts and w/o most decoys (it looks like HSV is included). This means that sequences that would align to alts in the original genome may not be aligned based on the paper's criteria
  2. bowtie2 was used in paired-end mode with all defaults; the CRAM files indicate that bwa mem was used (as expected for variant calling). bowtie2 by default requires end-to-end alignments and would result in many more reads not aligning to the reference genome, for example with reads which are soft-clipped (potential split reads, but also could include reads around deletion breakpoints.

I am performing a test realignment of two data sets against the GRCh38_noalts reference that is available (this is one of the prebuilt assemblies available from the bowtie2 site), then running a side-by-side comparison. As a note: both versions seemed to ignore trimming, but here we will include this to make sure there are no residual adapters the assembly.

cjfields commented 3 years ago

The newest version includes non-chimeric reads that have soft-clips and are also discordant (key features from reads around large insertion sites). We likely won't include split reads since these are not typically hallmarks of unique insertions but represent large-scale rearrangements.