medvedevgroup / SibeliaZ

A fast whole-genome aligner based on de Bruijn graphs
http://medvedevgroup.com/
Other
137 stars 19 forks source link

2 problems of sibeliaz in my experiment (Bug or not ?) #13

Open liaoherui opened 4 years ago

liaoherui commented 4 years ago

@iminkin Hi, iminkin , thanks a lot for your amazing tool, it plays a very important role in my research !

However, I got 2 problems when I did the experiment with sibeliaz to find the colinear block of a collection of highly similar viral genomes.

Problem1: Sibeliaz fails to run with 5 highly similar viral genomes (Cluster_17_revise.fasta). These 5 viral genomes are clustered by cdhit with an identity cutoff=99%, and I also check their similarity by online blast (>=99% cov and >=99% perc iden). When I run sibeliaz to find the block , it fails, and the running log is:

Problem1

The similar problems occur in other datasets too, and maybe, the high similarity or the small number of genomes will make sibeliaz fail to return the colinea block ? I doubt that.

Problem2: Similarly, I use sibeliaz to find the colinear block of another collection of highly similar viral genomes (still by cdhit clustering with identity=99% ). There are 402 genomes in this dataset (Cluster_0_Raw.fasta) and sibeliaz finishes the job without error. However, some genomes get 0 colinear block and I check the alignment between these genomes and the longest genome (ZKV_184) in this dataset with online blast. It's obvious that there are many colinear blocks between them. These genomes with 0 colinear block is:

ZKV_26, ZKV_58, ZKV_68, ZKV_73, ZKV_74, ZKV_78, ZKV_109, ZKV_145, ZKV_176, ZKV_195, ZKV_388, ZKV_394, ZKV_501, ZKV_765, ZKV_767, ZKV_774, ZKV_778, ZKV_780, ZKV_791

Btw, the parameter of this 2 experiments I used is "sibeliaz -t 6 -k 15 [input_file]". (I also tried the default parameter, but problems still exist...)

The 2 datasets of Problem1 (Cluster_17_revise.fasta) and Problem2 (Cluster_0_Raw.fasta) are uploaded. You can use them for test. I am confused by these 2 problems for a long time ... Test_Data.zip

iminkin commented 4 years ago

Hi @liaoherui ,

Thanks for reporting the issue. I will take a look soon.

liaoherui commented 4 years ago

@iminkin Ok! Thanks a lot !

iminkin commented 4 years ago

@liaoherui ,

1) The crash you reported was caused by a software bug in SibeliaZ. I pushed the fix to the master branch. Please update your installation and check if it works. Thanks for catching it!

2) These results are probably caused by the abundance (-a) parameter being set to value that is too small. Basically, all k-mers that occur more than (-a) times in the input get filtered out and it results in missing blocks if (-a) is not big enough. The default value (150) is optimized for long genomes (e.g. mammals) to filter out transposable elements and other repeats. In your case (many short genomes), please set it to something like 6400 or even higher. In general, to find all the blocks, set it to number_of_genomes highest_multiplicity_of_the_block 2. For more information, please refer to the documentation: https://github.com/medvedevgroup/SibeliaZ#vertices-frequency-threshold. To conclude, please run sibeliaz with parameter -a 6400.

Please let me know if it resolves the issues you reported.

liaoherui commented 4 years ago

@iminkin

The problems are all resolved with your help! That's really wonderful !

Thanks a lot for your help again!

liaoherui commented 4 years ago

@iminkin Sorry to ask another small question....

You mentioned the calculation method about the parameter "-a" is: number_of_genomes*highest_multiplicity_of_the_block * 2

I am a little confused that how can I know the value of "highest_multiplicity_of_the_block".

For example, in this case, there are totally ~400 genomes and it's easy to know the "highest_multiplicity_of_the_block"=8, cause you already told me that the number of -a better be 6400 (then I can get "highest_multiplicity_of_the_block"=6400/400/2=8), right?

So, how you get "highest_multiplicity_of_the_block"=8 before you run this tool. Still a little confused about this part, even I read "Vertices frequency threshold" section....

iminkin commented 4 years ago

Hi @liaoherui ,

Sorry for the confusion. You have to make an assumption about the highest number of instances a block has based on prior knowledge about the input genomes. For example, you may use the (estimated) size of the largest gene family in your genomes, if you are mainly interested in aligning genes. Another approach is to try different values and assess the results. I just tried 6400 and eyeballed the alignments a bit.

Please let me know if I can help you with anything else.

iminkin commented 4 years ago

I will also update the documentation to clarify the logic behind setting the parameters.

liaoherui commented 4 years ago

Ok. It's good to hear that. Thanks for your explaination and debug!

Btw, in my experiment, improving "-a" can make all genomes covered with blocks, but I also find a problem that you mentioned in "Vertices frequency threshold" section. If I set "-a" too high ,then it will make the graph large and convoluted (even in these short viral genomes), and that is a disaster for the downstream analysis. So, I guess the best "-a" for me should be the one that can cover all the input genome and also has a small influence to the result of the experiment.

(Sibeliaz is a very useful tool and I think if it can be relesed to "bioconda"(or conda) then it will be easier for users (espesially who work with servers, no root permission) to install. Or like vg to relese the executable binary file. Just my personal thoughts, and I have to say it's already cool now.

iminkin commented 4 years ago

@liaoherui

I am glad that you find SibeliaZ helpful in your work. We do have plans to improve the usability of our tools.

liaoherui commented 4 years ago

@iminkin Hi, iminkin! Recently, I found another problem about the script "maf_to_gfa1.py".

When I use this script to transform the maf file to gfa file, I found, for some genomes, the final path sequence in the graph is not completely same with their original genome sequence. A simple example is like :

......CATATAGT...... (sequence from original genome) ...... | - | - | . . |...... ......C- T -TCCT...... (sequence from path in the graph)

The program runs well, so I am reading the source code and trying to find out the potential reason why this happens...

But I will be really grateful if you know the possible reason....

iminkin commented 4 years ago

@liaoherui ,

Sorry for the late reply. I will take a look and fix it soon.

liaoherui commented 4 years ago

@liaoherui ,

Sorry for the late reply. I will take a look and fix it soon.

Thanks for your reply! I forget to say that this problem is already fixed by myself. It's a result of reverse strand. I found, when the block contains "-", sequences labeled as "-" are not completely same as original sequence. I make the sequences consistent with original genome by removing "-" sequences. :)

iminkin commented 4 years ago

@liaoherui , could you please elaborate? Do you mean that the bug occurs when FASTA files contain the characters '-'?

liaoherui commented 4 years ago

@iminkin Let me show you the example. image In this screeshot, there is a negative block ("-") of HCV_527. In fact, the original sequence of HCV_527 should be the reverse strand of this negative block.

More specifically, the original sequence of HCV_527 in this region should be : HCV_527 GTGATGCT.......

However, the script ("maf_to_gfa1.py") will encode the negative strand to the gfa file directly, which makes the path sequence in the graph is not completely same as original genome sequence.

iminkin commented 4 years ago

@liaoherui , I see, thanks for reporting, I will fix it.

iminkin commented 3 years ago

@liaoherui ,

Could you please provide the sequences, at least HCV_527, and HCV_528 so that I can exactly reproduce the error and fix it?

liaoherui commented 3 years ago

Ok. These 2 sequences are uploaded. :)

HCV.zip

iminkin commented 3 years ago

@liaoherui ,

Thank you very much! Unfortunately, I could not reproduce the problem on my own. Could you please provide the whole collection if possible, along with the MAF alignment file? Also, how do you convert the GFA back to FASTA? What was the command-line you used to generate the alignment with SibeliaZ?

liaoherui commented 3 years ago

The command-line I used to generate the alignment with SibeliaZ: sibeliaz -t 6 -k 15 HCV_mutation_strain_new.fasta The command-line I used to convert MSA to GFA: python maf_to_gfa1.py sibeliaz_out/alignment.maf HCV_mutation_strain_new.fasta > raw_graph.gfa

I extract the fasta from GFA using my own code, and the basic logic is to connect all nodes from specific path. A simple example: S 1 A S 2 T S 3 G P HCV_1 1+,2+,3+ Then, the fasta of HCV will be : >HCV_1 ATG

The whole collection and MSA alignment file can not be uploaded via Github. Is it ok to send those files (~8M) to your email (ivminkin@gmail.com)?

iminkin commented 3 years ago

@liaoherui yes, feel free to send the files

iminkin commented 3 years ago

BTW, I have a utility script for gluing back GFA into the sequences: https://github.com/medvedevgroup/TwoPaCo/blob/master/src/graphdump/glueGfa1.py

liaoherui commented 3 years ago

Ok. Thanks! Finally, I used this email address: 851836818@qq.com. You can check it.