schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
323 stars 35 forks source link

SyRI identifying wrong chromosomes to reverse complement #111

Open kaede0e opened 2 years ago

kaede0e commented 2 years ago

Hi, We have been using SyRI for two-genome comparison in multiple species pairs, and issue #48 has been helpful in fixing problems whenever "Index out of range" error shows up. The chromosomes to reverse complement (flip) usually gets identified by SyRI as a warning that looks something like this: "Reading Coords - WARNING - Reference chromosome CM018900.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014298.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors." So in this case, I would flip CM014298.1 and try again, and for the most part this has been working.

However, we encountered a weird situation with this particular pair we were working on(table_original.txt.gz - this is coords file for the minimap2 alignment with the unreversed genome). The usual troubleshooting step above did not work and SyRI couldn't run with the flipped genome. We were curious about what went wrong and so visualized this coords file (coords_file_for_visualization_original.v0.pdf), and found out that chromosomes SyRI told us to flip were incorrect.

We then manually selected chromosomes to flip based on the visualized plot, which was easy to identify, and re-ran SyRI. This time it worked. But weirdly enough, this manual flipping still produced the warning:


Reading Coords - WARNING - Chromosomes IDs do not match. Reading Coords - WARNING - Matching them automatically. For each reference genome, most similar query genome will be selected. Check mapids.txt for mapping used. Reading Coords - WARNING - Reference chromosome CM018888.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014286.1/rc). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018891.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014289.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018893.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014291.1/rc). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018894.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014292.1/rc). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018895.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014293.1/rc). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018896.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014294.1/rc). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018897.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014295.1/rc). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018900.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014298.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018901.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014299.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018903.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014302.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018905.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014304.1/rc). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018908.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014306.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018910.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014308.1/rc). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors.


This points out that some chromosomes SyRI automatically recognizes and suggests to reverse complement are not correct. We are wondering why this happens, and if it is something that comes up frequently. How does SyRI recognize the "high fraction of inverted alignments"? We are currently trying to automate the whole process, and seeking for a way around it so that we don't have to manually select chromosomes to flip from the visualization tools.

Thank you for your feedback, Kaede

mnshgl0110 commented 2 years ago

I am curious about the initial flipping and errors. I can imagine when the warning is useless, but did you find chromosomes that required flipping but SyRI did not give any warning for them?

Nevertheless, this is a very nice example to illustrate a current limitation of SyRI.

SyRI expects that the homologous chromosomes being compared would be closely related i.e. largely syntenic. It performs an initial check to see if number of inversely aligned bases is more than directly aligned bases for a reference chromosome. If yes, then it warns to ensure same strandedness, but this warning does not mean that the chromosome must be reverse complemented as currently SyRI cannot identify that.

Too-many inverted alignments could mean that different strands are being compared resulting in no identified synteny and subsequent crashes. However, it is possible (as here) that homologous chromosomes with same strands have many inversions, in which case the chromosomes should not be flipped, but SyRI could still give the warning.

It would be great to have an automatic identification of which chromosomes to flip and I intend to that, but I am lacking time to work on it. If you can set it up and would like to share the code, then I can add that to SyRI so that it does that directly from the BAM files itself.

kaede0e commented 2 years ago

Yes, I did manually find chromosomes which required flipping but SyRI didn't give error for. Just for your reference, this is the initial flipping and errors by SyRI. It gives warning for some chromosomes that are even in the right strandedness which was interesting:


Reading Coords - WARNING - Reference chromosome CM018889.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014287.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018890.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014288.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018891.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014289.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018899.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014297.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018900.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014298.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018901.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014299.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018902.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014301.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018903.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014302.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018906.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014300.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018908.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014306.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018911.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014309.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018912.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014311.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018913.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014312.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors. Reading Coords - WARNING - Reference chromosome CM018915.1 has high fraction of inverted alignments with its homologous chromosome in the query genome (CM014314.1). Ensure that same chromosome-strands are being compared in the two genomes, as different strand can result in unexpected errors.


It makes sense that SyRI recognizes the strand with too many inversions and warns that the chromosome pair is not syntenic when it actually is syntenic overall. We are working on it to see if the automation can be done on our pipeline. Thank you for your feedback!

mnshgl0110 commented 2 years ago

Yes, I did manually find chromosomes which required flipping but SyRI didn't give error for.

So, if I understand properly, these chromosomes required flipping because they were resulting in SyRI crashing even though there was no warning.. right? Could you please share an example chromosome so that I can check what is happening there?

kaede0e commented 2 years ago

Yes that is right. My apologies for the poor explanation. One example is CM014286.1, on the left top corner on the visualization. It requires flipping so it crashed, but you can see from my second discussion post that this chromosome was not listed in the warning. (Just a clarification: the error on the second discussion post is printed when I ran with the unflipped original genome. )

mnshgl0110 commented 2 years ago

Is it possible to share the sequence of CM014286.1 and its homolog? I would like to test it to see why it leads to crash even when majority of the genome has forward alignment.

kaede0e commented 2 years ago

For sure, here are the genome sequences: qryCM286.fasta.gz, refCM888.fasta.gz

mnshgl0110 commented 2 years ago

Thanks.

mnshgl0110 commented 2 years ago

Hi @kaede0e,

I finally got some time to check this. Syri was crashing because it would annotate the entire chromosome as an inversion. With https://github.com/schneebergerlab/syri/commit/f70300bc5913f18ac77b8da88df3e4d2518a7f5d this issue should be alleviated and it should not crash. However, it is still not possible to accurately predict whether the sequence needs to be inverted or not, but as you mentioned this can be easily checked from the visualization.

For example, the genomes you provided results in the following plotsr visualisation suggesting that the query genome might need to be inverted ref_qry

and after inversion the results becomes better ref_qry_rev

However, this does not help much in automatizing selection of chromosomes that requires to be inverted.

tiramisutes commented 2 years ago

Hope to be able to inverse a chromosome like jcvi by appending a - in the chromosome id. No need to reverse complement and re-align.

aaannaw commented 1 year ago

@kaede0e @mnshgl0110 Hello I also met the same problem. But I can not manually find chromosomes that needed to reverse complement owing to same strands. Could you give me any suggestions or softwares? Looking forward with your reply!

mnshgl0110 commented 1 year ago

You can try https://github.com/schneebergerlab/fixchr