schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
305 stars 36 forks source link

Support for promer #259

Open bricoletc opened 6 days ago

bricoletc commented 6 days ago

Hello!

Thank you for this brilliant tool. I've been using it for an application in which syntenies inferred using mummer's nucmer (so, at the DNA-level) were partial, when compared with mummer's promer (as assessed using mummerplot). This is unsurprising as promer is at the protein-level, so accesses more highly-diverged synteny.

I wanted to make a plotsr using promer coordinates and not nucmer coordinates and I have found a simple, though slightly hacky, way for doing it. I'm happy to share how I did it/PR instructions to your documentation page if you tell me where to do that

It involves formatting the output of show-coords on promer .delta files in the same way as on nucmer .delta files, as show-coords produces slightly different .coords files (docs here).

Maybe in the long run you'd want to build in support inside syri directly, it might not be too difficult.

Best, Brice

mnshgl0110 commented 5 days ago

Hi Brice, That's an interesting idea. If I understand correctly, in the following example where three proteins (P1, P2, and P3) are in the same order on two chromosomes but has a large TE insertion in one of them, promer + mummerplot would find these three proteins as syntenic.. right? image

This would be antithesis to the syri design as it is targeted to find such rearrangements as well. As such, I would not prefer to add "official" support for protein comparisons.

However, as a fan of hacky ways, I think, it would be possible to include the script that pre-process promer's show-coords file in the repositories. That would allow experienced users to do the manipulations themselves while keeping things simple for the less experienced users.

You are welcomed to open a pull-request (I think, normal fork and pull should work) and share your script.

Best Manish

bricoletc commented 4 days ago

Hello Manish,

For a PR, where are the documentation files that get published at https://schneebergerlab.github.io/syri?

And for your specific example, I will look in detail at my example and get back to you !

bricoletc commented 4 days ago

Hi again Manish,

So in your example, no, nucmer and promer will give the same results, provided P1, P2 and P3 are sufficiently similar to be aligned at the DNA level. I.e. both nucmer and promer + mummerplot would show hits between P1, P2 and P3, plus no alignment for the TE. (Btw promer, aligns all six-frame DNA translations of reference and query, so P1/P2/P3 probably don't even have to be true proteins)

However, using promer will increase the sensitivity of alignments for highly-diverged sequences. This can affect synteny, but in a good way IMO. Here is a concrete example, I aligned the mitochondrial sequences of two highly-diverged species using nucmer or promer, here are the mummerplots side by side (nucmer left, promer right):

image

For the left-hand plot, syri won't infer any changes in synteny, but that's only because the sequences in the middle are too diverged at the DNA-level to be aligned. Here's the syri VCF for the nucmer-based alignment:

##fileformat=VCFv4.3
##fileDate=20240705
##source=syri
##contig=<ID=contig_1,length=14620>
##ALT=<ID=SYN,Description="Syntenic region">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=TRANS,Description="Translocation">
##ALT=<ID=INVTR,Description="Inverted Translocation">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INVDP,Description="Inverted Duplication">
##ALT=<ID=SYNAL,Description="Syntenic alignment">
##ALT=<ID=INVAL,Description="Inversion alignment">
##ALT=<ID=TRANSAL,Description="Translocation alignment">
##ALT=<ID=INVTRAL,Description="Inverted Translocation alignment">
##ALT=<ID=DUPAL,Description="Duplication alignment">
##ALT=<ID=INVDPAL,Description="Inverted Duplication alignment">
##ALT=<ID=HDR,Description="Highly diverged regions">
##ALT=<ID=INS,Description="Insertion in non-reference genome">
##ALT=<ID=DEL,Description="Deletion in non-reference genome">
##ALT=<ID=CPG,Description="Copy gain in non-reference genome">
##ALT=<ID=CPL,Description="Copy loss in non-reference genome">
##ALT=<ID=SNP,Description="Single nucleotide polymorphism">
##ALT=<ID=TDM,Description="Tandem repeat">
##ALT=<ID=NOTAL,Description="Not Aligned region">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on reference genome">
##INFO=<ID=ChrB,Number=1,Type=String,Description="Chromosome ID on the non-reference genome">
##INFO=<ID=StartB,Number=1,Type=Integer,Description="Start position on non-reference genome">
##INFO=<ID=EndB,Number=1,Type=Integer,Description="End position on non-reference genome">
##INFO=<ID=Parent,Number=1,Type=String,Description="ID of the parent SR">
##INFO=<ID=VarType,Number=1,Type=String,Description="SR for structural arrangements, ShV for short variants, missing otherwise">
##INFO=<ID=DupType,Number=1,Type=String,Description="Copy gain or loss in the non-reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample
contig_1    1   NOTAL1  N   <NOTAL> .   PASS    END=1057;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.    GT  1
contig_1    1058    SYNAL1  N   <SYNAL> .   PASS    END=1629;ChrB=contig_1;StartB=1078;EndB=1647;Parent=SYN1;VarType=.;DupType=.    GT  1
contig_1    1058    SYN1    N   <SYN>   .   PASS    END=14596;ChrB=contig_1;StartB=1078;EndB=14858;Parent=.;VarType=SR;DupType=-    GT  1
contig_1    1629    HDR1    N   <HDR>   .   PASS    END=5383;ChrB=contig_1;StartB=1647;EndB=9170;Parent=SYN1;VarType=ShV;DupType=.  GT  1
contig_1    5384    SYNAL2  N   <SYNAL> .   PASS    END=6252;ChrB=contig_1;StartB=9171;EndB=10041;Parent=SYN1;VarType=.;DupType=.   GT  1
contig_1    6252    HDR2    N   <HDR>   .   PASS    END=12778;ChrB=contig_1;StartB=10041;EndB=13051;Parent=SYN1;VarType=ShV;DupType=.   GT  1
contig_1    12779   SYNAL3  N   <SYNAL> .   PASS    END=14596;ChrB=contig_1;StartB=13052;EndB=14858;Parent=SYN1;VarType=.;DupType=. GT  1
contig_1    14597   NOTAL2  N   <NOTAL> .   PASS    END=14620;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.   GT  1

Because promer does align the two sequences almost entirely, we can then see a translocation. Here's the corresponding syri VCF:

##fileformat=VCFv4.3
##fileDate=20240705
##source=syri
##contig=<ID=contig_1,length=14620>
##ALT=<ID=SYN,Description="Syntenic region">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=TRANS,Description="Translocation">
##ALT=<ID=INVTR,Description="Inverted Translocation">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INVDP,Description="Inverted Duplication">
##ALT=<ID=SYNAL,Description="Syntenic alignment">
##ALT=<ID=INVAL,Description="Inversion alignment">
##ALT=<ID=TRANSAL,Description="Translocation alignment">
##ALT=<ID=INVTRAL,Description="Inverted Translocation alignment">
##ALT=<ID=DUPAL,Description="Duplication alignment">
##ALT=<ID=INVDPAL,Description="Inverted Duplication alignment">
##ALT=<ID=HDR,Description="Highly diverged regions">
##ALT=<ID=INS,Description="Insertion in non-reference genome">
##ALT=<ID=DEL,Description="Deletion in non-reference genome">
##ALT=<ID=CPG,Description="Copy gain in non-reference genome">
##ALT=<ID=CPL,Description="Copy loss in non-reference genome">
##ALT=<ID=SNP,Description="Single nucleotide polymorphism">
##ALT=<ID=TDM,Description="Tandem repeat">
##ALT=<ID=NOTAL,Description="Not Aligned region">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on reference genome">
##INFO=<ID=ChrB,Number=1,Type=String,Description="Chromosome ID on the non-reference genome">
##INFO=<ID=StartB,Number=1,Type=Integer,Description="Start position on non-reference genome">
##INFO=<ID=EndB,Number=1,Type=Integer,Description="End position on non-reference genome">
##INFO=<ID=Parent,Number=1,Type=String,Description="ID of the parent SR">
##INFO=<ID=VarType,Number=1,Type=String,Description="SR for structural arrangements, ShV for short variants, missing otherwise">
##INFO=<ID=DupType,Number=1,Type=String,Description="Copy gain or loss in the non-reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample
contig_1    1   SYNAL1  N   <SYNAL> .   PASS    END=768;ChrB=contig_1;StartB=25;EndB=792;Parent=SYN1;VarType=.;DupType=.    GT  1
contig_1    1   SYN1    N   <SYN>   .   PASS    END=3555;ChrB=contig_1;StartB=25;EndB=3574;Parent=.;VarType=SR;DupType=-    GT  1
contig_1    768 HDR1    N   <HDR>   .   PASS    END=919;ChrB=contig_1;StartB=792;EndB=972;Parent=SYN1;VarType=ShV;DupType=. GT  1
contig_1    920 SYNAL2  N   <SYNAL> .   PASS    END=1474;ChrB=contig_1;StartB=973;EndB=1491;Parent=SYN1;VarType=.;DupType=. GT  1
contig_1    1474    HDR2    N   <HDR>   .   PASS    END=1656;ChrB=contig_1;StartB=1491;EndB=1681;Parent=SYN1;VarType=ShV;DupType=.  GT  1
contig_1    1657    SYNAL3  N   <SYNAL> .   PASS    END=3555;ChrB=contig_1;StartB=1682;EndB=3574;Parent=SYN1;VarType=.;DupType=.    GT  1
contig_1    3556    NOTAL1  N   <NOTAL> .   PASS    END=5311;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.    GT  1
contig_1    5312    SYNAL4  N   <SYNAL> .   PASS    END=5782;ChrB=contig_1;StartB=9096;EndB=9569;Parent=SYN2;VarType=.;DupType=.    GT  1
contig_1    5312    SYN2    N   <SYN>   .   PASS    END=9097;ChrB=contig_1;StartB=9096;EndB=12876;Parent=.;VarType=SR;DupType=- GT  1
contig_1    5743    SYNAL5  N   <SYNAL> .   PASS    END=5973;ChrB=contig_1;StartB=9531;EndB=9761;Parent=SYN2;VarType=.;DupType=.    GT  1
contig_1    5938    SYNAL6  N   <SYNAL> .   PASS    END=6279;ChrB=contig_1;StartB=9724;EndB=10068;Parent=SYN2;VarType=.;DupType=.   GT  1
contig_1    6279    HDR3    N   <HDR>   .   PASS    END=6313;ChrB=contig_1;StartB=10068;EndB=10103;Parent=SYN2;VarType=ShV;DupType=.    GT  1
contig_1    6314    SYNAL7  N   <SYNAL> .   PASS    END=8374;ChrB=contig_1;StartB=10104;EndB=12161;Parent=SYN2;VarType=.;DupType=.  GT  1
contig_1    8021    SYNAL8  N   <SYNAL> .   PASS    END=9097;ChrB=contig_1;StartB=11812;EndB=12876;Parent=SYN2;VarType=.;DupType=.  GT  1
contig_1    9098    NOTAL2  N   <NOTAL> .   PASS    END=9163;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.    GT  1
contig_1    9164    TRANSAL10   N   <TRANSAL>   .   PASS    END=11455;ChrB=contig_1;StartB=3720;EndB=5987;Parent=TRANS4;VarType=.;DupType=. GT  1
contig_1    9164    TRANS4  N   <TRANS> .   PASS    END=12522;ChrB=contig_1;StartB=3720;EndB=7077;Parent=.;VarType=SR;DupType=- GT  1
contig_1    11251   TDM4    N   <TDM>   .   PASS    END=11455;ChrB=contig_1;StartB=5784;EndB=6011;Parent=TRANS4;VarType=ShV;DupType=.   GT  1
contig_1    11251   TRANSAL11   N   <TRANSAL>   .   PASS    END=12522;ChrB=contig_1;StartB=5806;EndB=7077;Parent=TRANS4;VarType=.;DupType=. GT  1
contig_1    12523   NOTAL3  N   <NOTAL> .   PASS    END=12779;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.   GT  1
contig_1    12780   SYN3    N   <SYN>   .   PASS    END=14381;ChrB=contig_1;StartB=13053;EndB=14642;Parent=.;VarType=SR;DupType=-   GT  1
contig_1    12780   SYNAL9  N   <SYNAL> .   PASS    END=14381;ChrB=contig_1;StartB=13053;EndB=14642;Parent=SYN3;VarType=.;DupType=. GT  1
contig_1    14382   NOTAL4  N   <NOTAL> .   PASS    END=14620;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.   GT  1