Open bricoletc opened 6 days ago
Hi Brice,
That's an interesting idea. If I understand correctly, in the following example where three proteins (P1, P2, and P3) are in the same order on two chromosomes but has a large TE insertion in one of them, promer + mummerplot would find these three proteins as syntenic.. right?
This would be antithesis to the syri design as it is targeted to find such rearrangements as well. As such, I would not prefer to add "official" support for protein comparisons.
However, as a fan of hacky ways, I think, it would be possible to include the script that pre-process promer's show-coords
file in the repositories. That would allow experienced users to do the manipulations themselves while keeping things simple for the less experienced users.
You are welcomed to open a pull-request (I think, normal fork and pull should work) and share your script.
Best Manish
Hello Manish,
For a PR, where are the documentation files that get published at https://schneebergerlab.github.io/syri?
And for your specific example, I will look in detail at my example and get back to you !
Hi again Manish,
So in your example, no, nucmer and promer will give the same results, provided P1, P2 and P3 are sufficiently similar to be aligned at the DNA level. I.e. both nucmer and promer + mummerplot would show hits between P1, P2 and P3, plus no alignment for the TE. (Btw promer, aligns all six-frame DNA translations of reference and query, so P1/P2/P3 probably don't even have to be true proteins)
However, using promer will increase the sensitivity of alignments for highly-diverged sequences. This can affect synteny, but in a good way IMO. Here is a concrete example, I aligned the mitochondrial sequences of two highly-diverged species using nucmer or promer, here are the mummerplots side by side (nucmer left, promer right):
For the left-hand plot, syri
won't infer any changes in synteny, but that's only because the sequences in the middle are too diverged at the DNA-level to be aligned.
Here's the syri
VCF for the nucmer-based alignment:
##fileformat=VCFv4.3
##fileDate=20240705
##source=syri
##contig=<ID=contig_1,length=14620>
##ALT=<ID=SYN,Description="Syntenic region">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=TRANS,Description="Translocation">
##ALT=<ID=INVTR,Description="Inverted Translocation">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INVDP,Description="Inverted Duplication">
##ALT=<ID=SYNAL,Description="Syntenic alignment">
##ALT=<ID=INVAL,Description="Inversion alignment">
##ALT=<ID=TRANSAL,Description="Translocation alignment">
##ALT=<ID=INVTRAL,Description="Inverted Translocation alignment">
##ALT=<ID=DUPAL,Description="Duplication alignment">
##ALT=<ID=INVDPAL,Description="Inverted Duplication alignment">
##ALT=<ID=HDR,Description="Highly diverged regions">
##ALT=<ID=INS,Description="Insertion in non-reference genome">
##ALT=<ID=DEL,Description="Deletion in non-reference genome">
##ALT=<ID=CPG,Description="Copy gain in non-reference genome">
##ALT=<ID=CPL,Description="Copy loss in non-reference genome">
##ALT=<ID=SNP,Description="Single nucleotide polymorphism">
##ALT=<ID=TDM,Description="Tandem repeat">
##ALT=<ID=NOTAL,Description="Not Aligned region">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on reference genome">
##INFO=<ID=ChrB,Number=1,Type=String,Description="Chromosome ID on the non-reference genome">
##INFO=<ID=StartB,Number=1,Type=Integer,Description="Start position on non-reference genome">
##INFO=<ID=EndB,Number=1,Type=Integer,Description="End position on non-reference genome">
##INFO=<ID=Parent,Number=1,Type=String,Description="ID of the parent SR">
##INFO=<ID=VarType,Number=1,Type=String,Description="SR for structural arrangements, ShV for short variants, missing otherwise">
##INFO=<ID=DupType,Number=1,Type=String,Description="Copy gain or loss in the non-reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
contig_1 1 NOTAL1 N <NOTAL> . PASS END=1057;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=. GT 1
contig_1 1058 SYNAL1 N <SYNAL> . PASS END=1629;ChrB=contig_1;StartB=1078;EndB=1647;Parent=SYN1;VarType=.;DupType=. GT 1
contig_1 1058 SYN1 N <SYN> . PASS END=14596;ChrB=contig_1;StartB=1078;EndB=14858;Parent=.;VarType=SR;DupType=- GT 1
contig_1 1629 HDR1 N <HDR> . PASS END=5383;ChrB=contig_1;StartB=1647;EndB=9170;Parent=SYN1;VarType=ShV;DupType=. GT 1
contig_1 5384 SYNAL2 N <SYNAL> . PASS END=6252;ChrB=contig_1;StartB=9171;EndB=10041;Parent=SYN1;VarType=.;DupType=. GT 1
contig_1 6252 HDR2 N <HDR> . PASS END=12778;ChrB=contig_1;StartB=10041;EndB=13051;Parent=SYN1;VarType=ShV;DupType=. GT 1
contig_1 12779 SYNAL3 N <SYNAL> . PASS END=14596;ChrB=contig_1;StartB=13052;EndB=14858;Parent=SYN1;VarType=.;DupType=. GT 1
contig_1 14597 NOTAL2 N <NOTAL> . PASS END=14620;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=. GT 1
Because promer does align the two sequences almost entirely, we can then see a translocation.
Here's the corresponding syri VCF
:
##fileformat=VCFv4.3
##fileDate=20240705
##source=syri
##contig=<ID=contig_1,length=14620>
##ALT=<ID=SYN,Description="Syntenic region">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=TRANS,Description="Translocation">
##ALT=<ID=INVTR,Description="Inverted Translocation">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INVDP,Description="Inverted Duplication">
##ALT=<ID=SYNAL,Description="Syntenic alignment">
##ALT=<ID=INVAL,Description="Inversion alignment">
##ALT=<ID=TRANSAL,Description="Translocation alignment">
##ALT=<ID=INVTRAL,Description="Inverted Translocation alignment">
##ALT=<ID=DUPAL,Description="Duplication alignment">
##ALT=<ID=INVDPAL,Description="Inverted Duplication alignment">
##ALT=<ID=HDR,Description="Highly diverged regions">
##ALT=<ID=INS,Description="Insertion in non-reference genome">
##ALT=<ID=DEL,Description="Deletion in non-reference genome">
##ALT=<ID=CPG,Description="Copy gain in non-reference genome">
##ALT=<ID=CPL,Description="Copy loss in non-reference genome">
##ALT=<ID=SNP,Description="Single nucleotide polymorphism">
##ALT=<ID=TDM,Description="Tandem repeat">
##ALT=<ID=NOTAL,Description="Not Aligned region">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on reference genome">
##INFO=<ID=ChrB,Number=1,Type=String,Description="Chromosome ID on the non-reference genome">
##INFO=<ID=StartB,Number=1,Type=Integer,Description="Start position on non-reference genome">
##INFO=<ID=EndB,Number=1,Type=Integer,Description="End position on non-reference genome">
##INFO=<ID=Parent,Number=1,Type=String,Description="ID of the parent SR">
##INFO=<ID=VarType,Number=1,Type=String,Description="SR for structural arrangements, ShV for short variants, missing otherwise">
##INFO=<ID=DupType,Number=1,Type=String,Description="Copy gain or loss in the non-reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
contig_1 1 SYNAL1 N <SYNAL> . PASS END=768;ChrB=contig_1;StartB=25;EndB=792;Parent=SYN1;VarType=.;DupType=. GT 1
contig_1 1 SYN1 N <SYN> . PASS END=3555;ChrB=contig_1;StartB=25;EndB=3574;Parent=.;VarType=SR;DupType=- GT 1
contig_1 768 HDR1 N <HDR> . PASS END=919;ChrB=contig_1;StartB=792;EndB=972;Parent=SYN1;VarType=ShV;DupType=. GT 1
contig_1 920 SYNAL2 N <SYNAL> . PASS END=1474;ChrB=contig_1;StartB=973;EndB=1491;Parent=SYN1;VarType=.;DupType=. GT 1
contig_1 1474 HDR2 N <HDR> . PASS END=1656;ChrB=contig_1;StartB=1491;EndB=1681;Parent=SYN1;VarType=ShV;DupType=. GT 1
contig_1 1657 SYNAL3 N <SYNAL> . PASS END=3555;ChrB=contig_1;StartB=1682;EndB=3574;Parent=SYN1;VarType=.;DupType=. GT 1
contig_1 3556 NOTAL1 N <NOTAL> . PASS END=5311;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=. GT 1
contig_1 5312 SYNAL4 N <SYNAL> . PASS END=5782;ChrB=contig_1;StartB=9096;EndB=9569;Parent=SYN2;VarType=.;DupType=. GT 1
contig_1 5312 SYN2 N <SYN> . PASS END=9097;ChrB=contig_1;StartB=9096;EndB=12876;Parent=.;VarType=SR;DupType=- GT 1
contig_1 5743 SYNAL5 N <SYNAL> . PASS END=5973;ChrB=contig_1;StartB=9531;EndB=9761;Parent=SYN2;VarType=.;DupType=. GT 1
contig_1 5938 SYNAL6 N <SYNAL> . PASS END=6279;ChrB=contig_1;StartB=9724;EndB=10068;Parent=SYN2;VarType=.;DupType=. GT 1
contig_1 6279 HDR3 N <HDR> . PASS END=6313;ChrB=contig_1;StartB=10068;EndB=10103;Parent=SYN2;VarType=ShV;DupType=. GT 1
contig_1 6314 SYNAL7 N <SYNAL> . PASS END=8374;ChrB=contig_1;StartB=10104;EndB=12161;Parent=SYN2;VarType=.;DupType=. GT 1
contig_1 8021 SYNAL8 N <SYNAL> . PASS END=9097;ChrB=contig_1;StartB=11812;EndB=12876;Parent=SYN2;VarType=.;DupType=. GT 1
contig_1 9098 NOTAL2 N <NOTAL> . PASS END=9163;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=. GT 1
contig_1 9164 TRANSAL10 N <TRANSAL> . PASS END=11455;ChrB=contig_1;StartB=3720;EndB=5987;Parent=TRANS4;VarType=.;DupType=. GT 1
contig_1 9164 TRANS4 N <TRANS> . PASS END=12522;ChrB=contig_1;StartB=3720;EndB=7077;Parent=.;VarType=SR;DupType=- GT 1
contig_1 11251 TDM4 N <TDM> . PASS END=11455;ChrB=contig_1;StartB=5784;EndB=6011;Parent=TRANS4;VarType=ShV;DupType=. GT 1
contig_1 11251 TRANSAL11 N <TRANSAL> . PASS END=12522;ChrB=contig_1;StartB=5806;EndB=7077;Parent=TRANS4;VarType=.;DupType=. GT 1
contig_1 12523 NOTAL3 N <NOTAL> . PASS END=12779;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=. GT 1
contig_1 12780 SYN3 N <SYN> . PASS END=14381;ChrB=contig_1;StartB=13053;EndB=14642;Parent=.;VarType=SR;DupType=- GT 1
contig_1 12780 SYNAL9 N <SYNAL> . PASS END=14381;ChrB=contig_1;StartB=13053;EndB=14642;Parent=SYN3;VarType=.;DupType=. GT 1
contig_1 14382 NOTAL4 N <NOTAL> . PASS END=14620;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=. GT 1
Hello!
Thank you for this brilliant tool. I've been using it for an application in which syntenies inferred using
mummer
'snucmer
(so, at the DNA-level) were partial, when compared withmummer
'spromer
(as assessed usingmummerplot
). This is unsurprising aspromer
is at the protein-level, so accesses more highly-diverged synteny.I wanted to make a
plotsr
using promer coordinates and not nucmer coordinates and I have found a simple, though slightly hacky, way for doing it. I'm happy to share how I did it/PR instructions to your documentation page if you tell me where to do thatIt involves formatting the output of
show-coords
onpromer
.delta files in the same way as onnucmer
.delta files, asshow-coords
produces slightly different .coords files (docs here).Maybe in the long run you'd want to build in support inside
syri
directly, it might not be too difficult.Best, Brice