vpc-ccg / sedef

Identification of segmental duplications in the genome
MIT License
26 stars 8 forks source link

Has sedef been run for hg38? #13

Closed mrvollger closed 4 years ago

mrvollger commented 4 years ago

I am wondering if results of sedef already exist for hg38, and if so if you have compared them with the WGAC results for hg38?

Thanks! Mitchell

mrvollger commented 4 years ago

I want to bump/follow up on this.

In my test I when I run sedef on the repeat masked hg38 I only identify about 70% of the SDs in annotated for hg38. I was very surprised about this given the great results on hg19, and my thought is that I am doing something wrong.

If it would help for me to provide my inputs and outputs let me know.

Best, Mitchell

inumanag commented 4 years ago

Hi @mrvollger Thanks for the report--- that is indeed unexpected. I have not run it on hg38 yet, and I won't be able to do it anytime soon. However, my team is currently doing some large-scale code updates and refactoring and I will ask them to check this out.

mrvollger commented 4 years ago

In case it is helpful below I have copied my commands/analysis that shows about 75% overlap with hg38 segdups. I don't think there is an obvious error in my commands, but I figured I should share it in case. I have also attached the final.bed output. final.bed.gz

I look forward to the sedef updates. Please keep me updated.

sedef commands

# checked and this copy is repeat masked and TRF masked
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
samtools faidx hg38.fa
sedef.sh -o hg38_sedef_out -j 160 -t /tmp/mvollger/sedef/second_hg38_translation.fa hg38.fa

WGAC comparison

sedef=hg38_sedef_out/final.bed
bedtools sort -i $sedef | bedtools merge -i - > sedef.merged.bed
sedef_bases=$(awk '{sum+=$3-$2}END{print sum}' sedef.merged.bed)

wget http://humanparalogy.gs.washington.edu/build38/data/GRCh38GenomicSuperDup.tab
bedtools sort -i GRCh38GenomicSuperDup.tab | bedtools merge -i - > GRCh38GenomicSuperDup.merged.bed
sd_bases=$(awk '{sum+=$3-$2}END{print sum}' GRCh38GenomicSuperDup.merged.bed)

bedtools intersect -a GRCh38GenomicSuperDup.merged.bed -b sedef.merged.bed | \
        bedtools sort -i - |  bedtools merge -i - > intersect.bed
inter_bases=$(awk '{sum+=$3-$2}END{print sum}' intersect.bed)

printf "WGAC bases:\t$sd_bases\nsedef bases:\t$sedef_bases\nintersection:\t$inter_bases\n"

Results:

WGAC bases:     175457472
sedef bases:    271868967
intersection:   132604819
yenyilin commented 4 years ago

Currently you only use source of each SD, but some cases such as

chr11 63657736 63658666 chr10 100142023 100143016 S 59.6 + + 993 1309 m=6.5;g=53.1 1309 379 316 614 529 85 50 35 0.861564 0.404125 0.153041 0.154409 9 633 608 516 529 85 9 695 22M1D94M307D61M2I33M2D106M1I15M375I194M1D21M5D30M1I38M 0.849117

will have their source and sink swapped in some of my rerun. If you use both source and sink in the evaluation, you should get the results as follows.

WGAC bases:     175457472
sedef bases:    232109697
intersection:   164862725
mrvollger commented 4 years ago

Thanks, I checked one entry and it was symmetric and then assumed it would be symmetric like WGAC. May I suggest commenting in the README that the results are not symmetric, just for people who are used to WGAC like me.

I now get much better results:

WGAC bases:     175457472
sedef bases:    384622372
intersection:   173664586

However, I do notice that I have far more total bases than you. Is there any filtering you are doing?

Thanks! Mitchell

yenyilin commented 4 years ago

I see. I did not include patched chromosome (ALT and random) in my sedef run and therefore have much less reports than yours.

mrvollger commented 4 years ago

Makes perfect sense. I forgot to download no alts for this script.

Thanks again!

inumanag commented 4 years ago

Hi @mrvollger

One more thing I just remembered: by default, we filter out all chromosomes with "weird" names (patches, random and unplaced sequences). That can also affect the final numbers (WGAC includes them).

mrvollger commented 4 years ago

Got it, for my purposes I will be including the "weird" ones but excluding alts to try to recreate WGAC results. My results on this are now very good (99% overlap). Thanks again!

WGAC bases:     175457472
sedef bases:    241748573
intersection:   173641785
inumanag commented 4 years ago

Awesome--- that matches our experience with hg19!

If you are including alts and "weird" stuff, I recommend to run sedef with translate (-t) option for faster execution.

mrvollger commented 4 years ago

I have been using -t thanks!