open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

optical dedup stats #59

Closed sergpolly closed 2 years ago

sergpolly commented 6 years ago

Enhancement:

readid contains info about spatial position of the physical read (short DNA molecule) on a sequencing flowcell. This information can be used to find out how many of the observed duplicates are originating from PCR preceding sequencing step, and how many have been formed in the flowcell due to non-ideal conditions/read migration etc.

Proposition:

we can add an onlineDeduplicator-style buffer in the stats module that would keep track of incoming read-pairs and would start accumulating them as soon as it hits a DD-one, until the next one in the buffer in non-DD-one:

rid  c1   c2   p1   p2   s1   s2  pt  index
.    .    .    .    .    .    .   LL    1
.    .    .    .    .    .    .   DD    2
.    .    .    .    .    .    .   DD    3
.    .    .    .    .    .    .   DD    4
.    .    .    .    .    .    .   DD    5
.    .    .    .    .    .    .   LL    6

so with the input as above the content of the buffer would include pairs with index from 1-5 (including).

read_id (rid) column of that buffer can be further used to cluster pairs by their X,Y - position on the flowcell and to generate corresponding statistics. We could use the read-clustering algorithm developed by Betul @betulakgol in the lab, and would have to optional because it is not really known, if readid of "any" sequencer would have such information and if it is in the same exact format.

Software:

from coding perspective, we would only need to make stats.add_pair(c1,p1,s1,c2,p2,s2,pt) to accept a read-id as well stats.add_pair(rid,c1,p1,s1,c2,p2,s2,pt) (either permanently or optionally - how? dunno ...) and implement that onlineDedup-style buffer inside the stats, so that everyhting will be nice and hidden from the dedup perspective.

stats are becoming pretty heavy at that point, and some Cythonization might be applied to it (split stats into _stats.pyx with the class and methods and pairsamtools_stats.py - a stand alone script capable of doing stats).

Seems like a doable and not very intrusive thing, which I already discussed with @hakanozadam, @nvictus, @golobor - what do you think?

sergpolly commented 4 years ago

@betulakgol could you please dump all of your revelations on the topic of optical deduplication here, please - we'll see how to make use of it

betulakgol commented 4 years ago

To find optical duplicates I find identical sequences and check their positions on the flowcell. If they are close to each other I call them optical duplicates, if not I call them PCR duplicates. These scripts are not flexible for any mismatch.

Hiseq 4000 flowcell :

Screen Shot 2020-01-30 at 10 24 07 AM

The probability of two pcr duplicates being in the same tile is low, 1/112 * 1/112. So if two identical sequences located in the same file I consider them optical duplicates. Here what I use:

# Step 1
# Modify the fastq files

read1_all=$1
read2_all=$2

read1="$(cut -d'/' -f10 <<<$read1_all)"_R1
read2="$(cut -d'/' -f10 <<<$read1_all)"_R2

 echo $read1
 echo $read2
 echo " modify fastq files, R1 and R2"
 zcat $read1_all | awk '{ if (NR %4 ==1 || NR %4 ==2 ) {print $0} }' > "$read1.mod.fastq"
 zcat $read2_all | awk '{ if (NR %4 ==1 || NR %4 ==2 ) {print $0} }' > "$read2.mod.fastq"

 #  ### create table of R1 and R2 
 #  # ORS: output record seperator
 #  # NR: the line number
 #  # FS: input field seperator
 #  # RS: defines a new line

 cat "$read1.mod.fastq" | awk '{ORS=(NR%2?FS:RS)}1' > "$read1.mod.txt" 
 cat "$read2.mod.fastq" | awk '{ORS=(NR%2?FS:RS)}1' > "$read2.mod.txt"
 rm "$read1.mod.fastq" 
 rm "$read2.mod.fastq"

   echo "combine R1 and R2"
   paste <(awk '{print $1}' "$read1.mod.txt" ) \
   <(awk '{print $2}' "$read1.mod.txt" ) \
   <(awk '{print $2}' "$read2.mod.txt" ) \
   <(awk '{print $3}' "$read1.mod.txt" ) \
   <(awk '{print $3}' "$read2.mod.txt" ) \
   | gzip > "$read1.mod.R12.txt.gz"

  echo "Step 4 subset R12 file"
   zcat "$read1.mod.R12.txt.gz" | awk '{print $1"\t"$4$5}' | awk -F ':' '{print $5"\t"$6"\t"$7"\t"$8$9}' | gzip > "$read1.mod.subset.R12.txt.gz"

  # Step 2
  #Sort the modified file by the sequence. Paralel sorting

  # paralel sorting
  # echo "Step 5 sort the R12 file by the sequence"
  data_name="$read1.mod.subset.R12.txt"
  zcat $data_name | $HOME/bin/bin/sort --parallel=5 -k4,4 -o "$data_name.sorted.txt"

 #  # Step 3
 #  # Remove unique reads from the file.

  echo "the total number of reads "
  cat $data_name".sorted.txt" | wc -l

  cat $data_name".sorted.txt" | uniq -f3 -D > $data_name".sorted_dup.txt" 
  echo "the number of duplicated reads "
  cat $data_name".sorted_dup.txt" | wc -l

 # Step 4
 # Extract the duplicated reads that are in the same tile. 
 module load python3/3.5.0
 module load python3/3.5.0_packages/scipy/0.16.1
 echo " sort non unique reads "
 cat "$data_name.sorted_dup.txt" | $HOME/bin/bin/sort --parallel=5 -k1,1 > "$data_name.sorted_dup_sorted.txt"

#split data into tiles
 python3 ./split_file.py "$data_name.sorted_dup_sorted.txt"

echo " the number unique reads per tile "
for i in $(ls | grep "tile") ; do \
echo $i $(cat $i | uniq -f3 | wc -l ) ; done

echo "Duplicated reads per tile "
for i in $(ls | grep "tile") ; do \
python3 ./find_the_number_of_dup_pertile.py $i ; done

# Step 5
# calculate the % of duplication
module load R/3.2.2 
echo "Step 9 calculate the % of duplication "
Rscript --vanilla ./redundancy_last.R find_redundancy_houda_R1.log

Scripts:

split_file.py

from itertools import groupby
import gzip
import csv
import sys

for key, rows in groupby(csv.reader(open(sys.argv[1], 'rt'),delimiter='\t'),lambda row: row[0]):
    with open("%s.tile.txt" % key, "w") as output:
        for row in rows:
            output.write("\t".join(row)+ "\n")

find_the_number_of_dup_pertile.py

# This script reads the duplicated reads per tile. It prints if the duplication comes from the same tile or different tile.  

import gzip
import itertools
import re
import math
import json
import random
#from scipy import stats
from collections import defaultdict
import os
import sys
line_count=0
len_value_total=0
tile_dict = defaultdict(list)
seq_tile_dict = defaultdict(list)

with open(sys.argv[1], 'rt') as f:
    for line1 in f:
        line1_=line1.split()
        if len(line1_) > 3:
            tile_name=line1.split()[0]
            line_count=line_count+1
            # print(line1.split()[0])
            tile_dict[(line1.split()[3])].append((line1.split()[0]))
            #tile_dict[(line1.split()[3])].append((line1.split()[0])+":"+(line1.split()[1])+":"+(line1.split()[2]))
# print(len(tile_dict))

len_value_total=0
for key, value in tile_dict.items():
    len_value = len(tile_dict[key])
    if len_value>1:
        #print(len_value)
        len_value_total=len_value_total+len_value-1
        #print(str(tile_name)+"  "+str(len_value_total))
different_tile = line_count-len_value_total
print(str(tile_name)+" total_seq="+str(line_count)+" same_tile="+str(len_value_total)+" different_tile="+str(different_tile))

redundancy_last.R

args = commandArgs(trailingOnly=TRUE)
filename=args[1]
# filename="find_redundancy.log"
data <- read.delim(filename,header=T,sep="\t",fill=TRUE)
total_numofreads=data[1,1]
print(total_numofreads)
total_dup_seq=0
same_tile=0
diff_tile=0
duplication_stats=c()

start_tile = grep("Step 8",data[,1],fixed=TRUE)+1
end_tile = start_tile+111

print(start_tile)
print(end_tile)
for (i in start_tile:end_tile)
{
  info=data[i,1]
  get_info = gsub(" ","=",info)
  get_info = strsplit(get_info,split="=")
  total_dup_seq = total_dup_seq+as.numeric(as.character(get_info[[1]][3]))
  same_tile=same_tile+as.numeric(as.character(get_info[[1]][5]))
  diff_tile=diff_tile+as.numeric(as.character(get_info[[1]][7]))
}
percent_of_dup_reads = total_dup_seq*100/as.numeric(as.character(total_numofreads))
duplication_stats=rbind(duplication_stats,percent_of_dup_reads)
same_tile_res=same_tile*100/as.numeric(as.character(total_numofreads))
duplication_stats=rbind(duplication_stats,same_tile_res)
diff_tile_res=diff_tile*100/as.numeric(as.character(total_numofreads))
duplication_stats=rbind(duplication_stats,diff_tile_res)
write.table(duplication_stats,"duplication_stats.txt")
sergpolly commented 4 years ago

Thank you @betulakgol !

betulakgol commented 4 years ago

This analysis is designed specifically for patterned flowcell (Hiseq 4000), beware of non-patterned flowcells.

https://docs.google.com/presentation/d/1-VHdLBglclF2JehN7nymFriDPx-Lh5c69x3RS-Btgm8/edit?usp=sharing

sergpolly commented 4 years ago

relevant: https://www.biostars.org/p/12538/ http://core-genomics.blogspot.com/2016/05/increased-read-duplication-on-patterned.html

golobor commented 4 years ago

@sergpolly @betulakgol so, you're thinking to implement optical dedup?

mimakaev commented 4 years ago

I thought we are in a unique position to actually get rid of all/most of the duplicates by Hi-C dedup - because two sides of the read are far away, and we can use just start/end to do dedup. 1D genomic tracks do not have this luxury, so there it may be more relevant.

sergpolly commented 4 years ago

I guess we do get rid of them just fine this was more of a debug for experimentalists - i.e. in trying to understand why there are so many duplicates - poor libarary prep (too many PCR etc) - or is it something with the sequencer

at least this was original thinking and AFAIK there were HiSeq4000-specific duplicate related issues in the dekkerlab

sergpolly commented 4 years ago

@golobor this is far from priority right now - I just wanted to make @betulakgol share her findings on the topic before we completely forget this

golobor commented 4 years ago

ah, gotcha! I guess, one simple thing we could do is the following: (a) as you suggested, pass readid to the dedupper (b) for each read tagged as a duplicate, calculate it's euclidian distance to the read that it's a duplicate of (c) output this euclidian distance as an extra column to the pairfile with duplicates

Alternatively to (a-c), we could instead do this: (a', per previous discussion with @Phlya): for every duplicate, we should just store the ID of the read that it's a duplicate of. This should provide enough information for a separate CLI tool that would run through the duplicate pairs file and detect all optical duplicates.

On Sat, 1 Feb 2020 at 20:49, Sergey Venev notifications@github.com wrote:

@golobor https://github.com/golobor this is far from priority right now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairtools/issues/59?email_source=notifications&email_token=AAG64CUU356GKSUOQBCRQGLRAXG5RA5CNFSM4EIQN6LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKRFFSI#issuecomment-581063369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG64CW5UQU364U5KSFAZPLRAXG5RANCNFSM4EIQN6LA .

Phlya commented 2 years ago

@sergpolly

AFAIK there were HiSeq4000-specific duplicate related issues in the dekkerlab

Do you remember what these issues were? Or someone else there? I think all patterned flowcells have weird distribution of duplications, at least for my micro-C libraries that I have been testing...

betulakgol commented 2 years ago

@Phlya @sergpolly The HiSeq 4000 with the patterned flowcell increased the technical duplicates of an experiment. Basically the amplification (before the base detection ) bridges to the next pores and same reads are over amplified and sequenced. We observed this in Hi-C experiments. One thing that helped us to load less material to the flow cell before sequencing.

Phlya commented 2 years ago

@betulakgol thank you! So you think it was worth lowering number of clusters to get fewer reads, but with fewer duplicates? I have been struggling with these weird duplicates for a while - I have libraries where 80% of all duplicates are some kind of artefactual duplicate from the sequencer, and that's like 15% of all mapped reads! Which is a bit crazy, and doesn't really allow to understand whether the libraries are good or not from shallow sequencing...

betulakgol commented 2 years ago

@Phlya Oh, your duplication rate sounds bad. We did not have as much, I think we had around 30% that are technical duplicates. And yes, lowering the number of clusters help and may not give you less reads. I don't know how much the sequencing core lowered the loading material but we had similar number of reads in both libraries; one with regular loading and one with less loading. One other thing sequencing core tried is to add extra washing step for the libraries before sequencing. Hope that is helpful.

Phlya commented 2 years ago

@betulakgol thank you! I am meeting the head of the facility tomorrow and will discuss our options. They only just got a NovaSeq and didn't have anything with patterned flowcells recently, so they are also new to this I am afraid... He mentioned they generally try to load on the higher end, in part to try to prevent this duplication actually, I assume to avoid having empty nanowells that could allow clustering duplicates. Did extra washing help?

betulakgol commented 2 years ago

@Phlya Extra washing step definitely helped us. I just checked, in our case loading less material improved mapped reads by 14% and the washing improved it by 13%. Our 51% duplication rate decreased to 24 % with these two steps. Good luck with the meeting and hopefully you solve the issue.

Phlya commented 2 years ago

@betulakgol Thanks a lot, that's very helpful! I might ask you for specifics later, if the facility is interested in trying these things.

betulakgol commented 2 years ago

@Phlya I am happy to help.

Phlya commented 2 years ago

@betulakgol Talking to our facility head, he says that the chemistry in NovaSeq is different than in HiSeq4000, and loading less should only lead to more clustering duplication! Just FYI.

agalitsyna commented 2 years ago

Added in https://github.com/open2c/pairtools/pull/105 to pre1.0.0 pairtools version.