vpc-ccg / sedef

Identification of segmental duplications in the genome
MIT License
26 stars 8 forks source link

duplicated SD results #21

Closed mrvollger closed 2 years ago

mrvollger commented 3 years ago

Hi,

I am finding that some SDs are reported multiple times but with slightly different alignments so they get past the simple duplicate filter. For example:

wc -l final.bed; sort final.bed | uniq | wc -l ; cat final.bed  | cut -f 1,2,3,4,5,6 | sort | uniq | wc -l ; 
124580 final.bed # the number of lines in the file
124580 # number of unique lines in the file
123909 # number of unique pairs in the file

And you can see the number of lines exceeds the number of unique pairs.

Is this intended (I expect not)? If not is there a good way to filter these?

You can find the results of my run here if that is helpful: https://eichlerlab.gs.washington.edu/help/mvollger/share/sedef/

Thanks! Mitchell

inumanag commented 2 years ago

Hi @mrvollger

SEDEF is now deprecated in favour of BISER. Please see if you have the same issue with BISER.