schneebergerlab / msyd

MIT License
8 stars 0 forks source link

very few coresynteny region #14

Open baozg opened 1 week ago

baozg commented 1 week ago

Hi, @lrauschning

I tried msyd with 279 A.thaliana genomes, but the core synteny region is small. For example, Chr1 only has several regions beginning only in the Chr1:1-200000 and Chr1:2700000-32000000. Would it be possible to lower the threshold? Like any region have SNPs and InDels less than 50 bp still could be coresynteny?

By the way, why msyd generate lots of duplicated region with different synID?

Chr5    5314409 5360725 CORESYN423      Chr5:5314934-5361133,ref,48=1X77=1X36=1I2=1X45=1X11=1X1099=1X143=1X91=35D144=12D3=1X29=1X83=2I49=4I48=1X13=1I345=1X457=1X478=3D1477=2I421=1X466=1X340=>
Chr5    5360720 5360880 CORESYN424      Chr5:5361128-5361288,ref,161=   Chr5:5384127-5384287,ref,161=   Chr5:5405124-5405284,ref,161=   Chr5:5346493-5346653,ref,161=   Chr5:5336475-5336635,r>
Chr5    5360813 5360880 CORESYN425      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN426      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN427      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN428      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN429      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN430      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN431      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN432      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN433      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN434      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN435      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN436      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN437      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN438      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN439      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN440      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN441      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN442      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN443      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN444      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN445      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN446      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN447      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN448      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN449      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN450      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN451      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN452      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN453      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360880 CORESYN454      Chr5:5361221-5361288,ref,68=    Chr5:5384220-5384287,ref,68=    Chr5:5405217-5405284,ref,68=    Chr5:5346586-5346653,ref,68=    Chr5:5336568-5336635,r>
Chr5    5360813 5360923 CORESYN455      Chr5:5361221-5361331,ref,111=   Chr5:5384220-5384330,ref,111=   Chr5:5405217-5405327,ref,111=   Chr5:5346586-5346696,ref,111=   Chr5:5336568-5336678,r>
Chr5    5473204 5605559 CORESYN456      Chr5:5472870-5602260,ref,453=1X988=1X2531=46D183=1X466=1D2117=1X67=1X1304=1X5037=1X185=2I580=2I3=1I3775=1D2051=1X15=1I19=1X186=1X332=1D197=1X61=1X756=>
lrauschning commented 1 week ago

Hi Zhigui,

nice to see msyd didn't crash on so many genomes :D how long did it run BTW? The duplication of CORESYN425-454 seems like a bug. I ran into some similar issues running the realignment algorithm when minimap2 wasn't finding any alignments, causing duplicate regions to be inserted but never had that occur with coresynteny. Do you have the logs or a sample so I can replicate it & try to find where the bug is happening?

Weird that msyd seems to find quite some coresynteny in Chr5 but little in Chr1. How does it look in the other chrs? Absence of coresyn in a region can be caused by individual large structural variants/misassemblies (we saw this e.g. with the Sha inversion on Chr3 in the AMPRIL population). It might be worth plotting Chr1 using plotsr to see if any samples are highly divergent/missassembled.

More generally, SyRI/msyd does find synteny even with indels/snps in them (e.g. CORESYN456 in the output above) – in our experience, playing around with minimap2 parameters can help to find somewhat more consistent synteny. You could also try changing the INDEL threshold in SyRI here, though it has been working fine for us with default threshold so far. The threshold msyd uses during the synteny intersection step is defined here as 30 bp (not exposed in the CLI right now), you can also try to play around with that.

LMK if that helps!

baozg commented 1 week ago

I confirmed with miniprot alignment that at least Chr1 could find many shared core genes. So I think some weird thing happened. It only finds 11Mb sequences across 5 chromosomes. If you take genomes from NCBI public genomes, may could be enough to catch this. I could share you with a link