schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
322 stars 35 forks source link

Cannot assign chromosomes automatically. #30

Closed qiuyixmm closed 4 years ago

qiuyixmm commented 4 years ago

Hello, when I run syri, a error masssage was reported :

   syri - WARNING - starting
   Reading Coords - WARNING - Chromosomes IDs do not match.
   Reading Coords - WARNING - Matching them automatically. For each reference genome, most 
   similar query genome will be selected. Check mapids.txt for mapping used.
   Reading Coords - ERROR - Chr4 in genome B is best match for two chromosomes in genome A. 
   Cannot assign chromosomes automatically.

A and B are not the same species. How can I sovle this problem ? Thanks!

mnshgl0110 commented 4 years ago

Hi,

If a chromosome in Genome B is matching with two chromosomes, then the algorithms cannot decide which chromosome to select as the homologous chromosome. And identification of homologous chromosomes is the first step to find syntenic regions.

In your earlier comments, you mentioned that the Chr4 was split into two chromosomes in the other species. If that is the case, then you can just concatenate the two chromosomes to generate a pseudo Chr4, which will be homologous to Chr4 of genome B, and then do the analysis. Alternatively, you can select one of the two matches as the homologous chromosome, but then you would need to discard the other match.

Best Manish

qiuyixmm commented 4 years ago

Hello, Thanks for your suggestion! At first, I used the same chromosome identifiers in both A and B genome. Then the error about "Cannot assign chromosomes automatically" was not reported any more. Unfortuntely, a new error came:

syri - WARNING - starting
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
File "~/python3/envs/syri/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
File "~/python3/envs/syri/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
File "syri/pyxFiles/synsearchFunctions.pyx", line 359, in syri.pyxFiles.synsearchFunctions.syri
File "syri/pyxFiles/synsearchFunctions.pyx", line 697, in syri.pyxFiles.synsearchFunctions.getSynPath
ValueError: max() arg is an empty sequence
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "~/syri-1.1/syri/bin/syri", line 115, in <module>
    chrlink = startSyri(args)
File "syri/pyxFiles/synsearchFunctions.pyx", line 308, in syri.pyxFiles.synsearchFunctions.startSyri
File "syri/pyxFiles/synsearchFunctions.pyx", line 309, in syri.pyxFiles.synsearchFunctions.startSyri
File "~/python3/envs/syri/lib/python3.5/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
File "~/python3/envs/syri/lib/python3.5/multiprocessing/pool.py", line 644, in get
    raise self._value
ValueError: max() arg is an empty sequence

Someone also reported the same errors on your github issues. Your suggestion is that the errors were caused by no forward aligenment for that corresponding choromosomes. So I reverse all the inverted chromosomes and run syri again. But the error above still exist. Here is my syri.log file.

2020-04-27 16:47:14,512 - syri - WARNING - <module>:115 - starting
2020-04-27 16:47:14,515 - Reading Coords - INFO - <module>:115 - Reading input from .tsv file
2020-04-27 16:47:14,679 - syri - INFO - <module>:115 - Analysing chromosomes: ['LG01', 'LG02', 'LG03', 'LG04', 'LG05', 'LG06', 'LG07', 'LG08', 'LG09', 'LG10', 'LG11', 'LG12', 'LG13', 'LG14', 'LG15', 'LG16', 'LG17', 'LG18', 'LG19', 'LG20', 'LG21', 'LG22', 'LG23', 'LG24', 'LG25', 'LG26', 'LG27', 'LG28', 'LG29', 'LG30']
2020-04-27 16:47:14,850 - syri.LG01 - INFO - mapstar:44 - LG01 (4968, 11)
2020-04-27 16:47:14,851 - syri.LG01 - INFO - mapstar:44 - Identifying Synteny for chromosome LG01
2020-04-27 16:47:19,804 - syri.LG01 - INFO - mapstar:44 - Identifying Inversions for chromosome LG01
2020-04-27 16:47:21,746 - syri.LG01 - INFO - mapstar:44 - Identifying translocation and duplication for chromosome LG01
2020-04-27 16:47:29,569 - syri.LG02 - INFO - mapstar:44 - LG02 (3941, 11)
2020-04-27 16:47:29,571 - syri.LG02 - INFO - mapstar:44 - Identifying Synteny for chromosome LG02
2020-04-27 16:47:32,355 - syri.LG02 - INFO - mapstar:44 - Identifying Inversions for chromosome LG02
2020-04-27 16:47:33,170 - syri.LG02 - INFO - mapstar:44 - Identifying translocation and duplication for chromosome LG02
2020-04-27 16:47:37,185 - syri.LG03 - INFO - mapstar:44 - LG03 (2579, 11)
2020-04-27 16:47:37,185 - syri.LG03 - INFO - mapstar:44 - Identifying Synteny for chromosome LG03
2020-04-27 16:47:38,628 - syri.LG03 - INFO - mapstar:44 - Identifying Inversions for chromosome LG03
2020-04-27 16:47:39,310 - syri.LG03 - INFO - mapstar:44 - Identifying translocation and duplication for chromosome LG03
2020-04-27 16:47:41,613 - syri.LG04 - INFO - mapstar:44 - LG04 (1362, 11)
2020-04-27 16:47:41,613 - syri.LG04 - INFO - mapstar:44 - Identifying Synteny for chromosome LG04
2020-04-27 16:47:42,194 - syri.LG04 - INFO - mapstar:44 - Identifying Inversions for chromosome LG04
2020-04-27 16:47:42,503 - syri.LG04 - INFO - mapstar:44 - Identifying translocation and duplication for chromosome LG04
2020-04-27 16:47:43,648 - syri.LG05 - INFO - mapstar:44 - LG05 (1191, 11)
2020-04-27 16:47:43,648 - syri.LG05 - INFO - mapstar:44 - Identifying Synteny for chromosome LG05
2020-04-27 16:47:44,161 - syri.LG05 - INFO - mapstar:44 - Identifying Inversions for chromosome LG05
2020-04-27 16:47:44,419 - syri.LG05 - INFO - mapstar:44 - Identifying translocation and duplication for chromosome LG05
2020-04-27 16:47:46,292 - syri.LG06 - INFO - mapstar:44 - LG06 (2, 11)
2020-04-27 16:47:46,292 - syri.LG06 - INFO - mapstar:44 - Identifying Synteny for chromosome LG06
2020-04-27 16:47:46,324 - syri.LG06 - INFO - mapstar:44 - Identifying Inversions for chromosome LG06
2020-04-27 16:47:46,362 - syri.LG06 - INFO - mapstar:44 - Identifying translocation and duplication for chromosome LG06
2020-04-27 16:47:46,480 - syri.LG07 - INFO - mapstar:44 - LG07 (1, 11)
2020-04-27 16:47:46,480 - syri.LG07 - INFO - mapstar:44 - Identifying Synteny for chromosome LG07
2020-04-27 16:47:46,506 - syri.LG07 - INFO - mapstar:44 - Identifying Inversions for chromosome LG07
2020-04-27 16:47:46,542 - syri.LG07 - INFO - mapstar:44 - Identifying translocation and duplication for chromosome LG07
2020-04-27 16:47:46,615 - syri.LG08 - INFO - mapstar:44 - LG08 (0, 11)
2020-04-27 16:47:46,615 - syri.LG08 - INFO - mapstar:44 - Identifying Synteny for chromosome LG08
2020-04-27 16:47:46,654 - syri.LG09 - INFO - mapstar:44 - LG09 (0, 11)
2020-04-27 16:47:46,654 - syri.LG09 - INFO - mapstar:44 - Identifying Synteny for chromosome LG09
qiuyixmm commented 4 years ago

From output files of some finished chromosome, it seems that syri can only support to matched chromosomes well. Undoubtedly it can have best performance for same species. However, I also want to obtain results among any chromosomes not just limit to matched chromosomes when comparing one species to the other although creating psuo-chromosomes is alternative solution. Thanks for your kindness.

mnshgl0110 commented 4 years ago

Hi.

You are indeed correct to say that SyRI works best for matched (homologous) chromosomes. SyRI needs to identify syntenic regions between homologous chromosomes as the structural rearrangements are annotated based on them. In the absence of syntenic regions, it is not possible to differentiate between different rearrangements (for ex: whether an inverted alignment corresponds to an inversion or to an inverted translocation).

In the above log file, we can see that LG08, LG09 have no forward alignment (LG06, LG07 have very few). Based on this, I assume that they are not homologous, resulting in SyRI crashing. As of now, SyRI is not able to identify rearrangements in the absence of syntenic regions. You can consider filtering these non-homologous chromosomes, if possible.

Also, I think to compare non-homologous chromosomes might not be the best strategy as even though an algorithm might annotate variations between them, there would be high chances that the annotations correspond to random noise rather than actual biological signal.