stefinfection / RUFUS

RUFUS k-mer based genomic variant detection
0 stars 0 forks source link

Address Illumina chemistry problem which interferes with tandem duplication calling #44

Open stefinfection opened 2 weeks ago

stefinfection commented 2 weeks ago

This issue arose due to Illumina no longer doing size selection on their new chemistry. This allows for many overlapping reads starting at the exact same base to pileup, and appear to look like a breakpoint. AF's fastp fix addresses this, but does so by throwing away any reads aligning to themselves, which has a side effect of also throwing away reads representing real tandem duplications. To address this, we can allow a small amount of overlap to exist between the reads - this should get rid of true chemistry artifacts and keep true tandem dups. Yingqi from HU's lab has written a python script to do this already: merged_select.py. A simple fix is to include this according to the screenshot below. A longer term fix will be to look for read that align on top of themselves at the same starting point in the assembly step of RUFUS, to eliminate the need for the patch script.

stefinfection commented 2 weeks ago

Image

stefinfection commented 2 weeks ago

import argparse
import os, sys, re
from pathlib import Path
import gzip, statistics
import fnmatch
import numpy as np
import scipy.stats as st
import pandas as pd
import math
from itertools import islice

fin = str(sys.argv[1])
fout = str(sys.argv[2])

with open(fout, 'w') as fileout:
    with open(fin, 'r') as filein:
        while True:
            lines_gen = list(islice(filein, 4))
            if not lines_gen:
                break
            # process lines_gen
            rlength = int(lines_gen[0].rstrip().split()[1].split('_')[1]) + int(lines_gen[0].rstrip().split()[1].split('_')[2])
            if rlength >= 150:
                for line in lines_gen:
                    fileout.write(line.rstrip()+'\n')
                    if line.rstrip()[0] != '@' and line.rstrip()[0] != '+':
                        fileout.write(line.rstrip()[10:-10]+'\n')
                    else:
                        fileout.write(line.rstrip()+'\n')```
stefinfection commented 2 weeks ago

The first experiment to do is to run the existing version of RUFUS without Yingqi's script on the COLO829 cell line and look at the SVs. Then incorporate the script and see if calling improves on tandem dups.