schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
306 stars 36 forks source link

Why is syri stuck in last step and using more than 70% memory of server? #118

Closed farhan-lab closed 2 years ago

farhan-lab commented 2 years ago

I'm running syri for analysis of cichlid genomes. Following all the steps of nucmer alignments, when I started the syri at first attempt, it gave an error about the inverted chromosomes error and I manage to fix this with the available "Chrrev.py" script. But, at the second attempt, after fixing inverted chromosomes, I managed to run syri but from the syri.log file it seems stuck and taking long hours until the server memory usage reaches 70% and I had to kill the process avoiding any system crash. I also attach a screenshot of the log. Could you please guide if there is any solution, how to fix the memory usage problem?

Many thanks, Regards syri_log

mnshgl0110 commented 2 years ago

Hi. This is sort of a known limitation of syri that it may take a lot of time and memory in rare cases when overlapping translocations and duplications consists of many small alignments. Does cichlid have many large repetitive sequence? Because that can cause this problem sometimes.

I had not seen this happening for quite some time so I assumed that the latest heuristics would have solved this, but apparently not so. How long did it run before you killed it? And how much memory did it use exactly?

You can try decreasing the values of --tdgaplen and --tdmaxolp options and increasing --unic. This could decrease the runtime and memory usage significantly at the cost of a little decrease in prediction accuracy. Also, please run syri with --log DEBUG. If the issue persists, it would be better to see the more detailed log file.

Best Manish

farhan-lab commented 2 years ago

Hi Manish. Many thanks for your response with details. Yes, we expect cichlid genome might have large repetitive sequences. More importantly, I am comparing two genomes of the same species, in which "qrygenome" contain an extra chromosome "The B chromosome" (which is highly repetitive) while "refgenome" is without "B chromosome" to identify the structural rearrangements. But for syri run, the two genomes should have equal number of chromosomes, therefore I purposely added the same "B chromosome" in my refgenome to fulfill the syri requirements. I imagine, later I can filter out the B chromosome linked rearrangements in refgenome from the output file. I tried to rerun syri with your recommended options lowering the values, but the runtime and memory usage still remain the issue. Within 2-3 hour, the syri reaches 70% memory of the server and I have to kill the process. In the "log debug" it seems stuck during "make tree" step. Many thanks, Best, Farhan

mnshgl0110 commented 2 years ago

Hi Farhan. Which aligner are you using? In case you are using nucmer, I would suggest to try using minimap2. Nucmer is generally more sensitive but may result in a lot of noise in such regions. Minimap2 alignments might be cleaner and could solve this issue. Please use minimap2 v2.17 or the >v2.23. If you think that the B chromosome is the repetitve one, then you may consider removing it from the analysis. However, this would not work if you are interested in B chromosome itself. Would it be possible to share the log file, the input genomes and the commands that you are running? It could help me in pinpointing the issue and maybe improve the performance. Best Manish