salzberg-lab / bolotie

SARS-CoV-2: detecting recombinations in viruses using large data sets with high sequence similarity
13 stars 4 forks source link

Not getting plots and paths is Empty #6

Open ShadiKhoury opened 2 years ago

ShadiKhoury commented 2 years ago

hey, I'm dealing with a problem where I seem not to get any plots out, I tried to fix this issue and notice some things that seemed not to work as expected. mainly the paths are empty and that's why it's not plotting : image

if we look at other stuff : first: the paths.fa is empty second : the probmat.probs is only nan image the paths used to plot the recombinant is empty image the paths dict is also empty : image

I can't seem to figure out why this is happening if I run the example_hiv it works.

my inputs : cluster.txt my_fasta.txt ref_seq.txt

alevar commented 2 years ago

Hi,

Indeed, the probability matrix should not contain so NaN values and the issue needs to be corrected in the software.... As for the reason why this may be happening, it looks like your data does not have enough information to compute the probabilities. In your input, only one of the sequences (namely "7035967" ) passes the default filtering parameters (number of ambiguous characters) and all other sequences are being discarded due to many missing positions. For bolotie to work properly, it requires a good probability matrix with each clade having sufficient representation.

As for the error codes - I will try adding an assertion to the software notifying user (without errorring out) when an error is expected to occur due to insufficient data in the inputs.

ShadiKhoury commented 2 years ago

is there a way to maybe input a pre-used probability matrix to the software, although I don't know how accurate that can be? as for not enough information can adding more data (sequences) help? and if yes does it have to be a specific clade only or we can add multiples?

alevar commented 2 years ago

Certainly - in fact that's the intended way to run the method. Even though the method was designed to construct indices fast, it is still a laborious task and dataset preparation is just as important (perhaps even more so) as the other parameters. You could either contact authors of any recent papers which used bolotie to obtain their indices or build one yourself and re-use for other experiments. We are also providing an index on the ftp site which we used in our original analysis. That index is outdated however and is unlikely to be representative of the current phylogenetic structure. Please note that whenever you use someone else's index - you have to make sure you also use the same reference genome as them.

As for your other question, regarding adding more data - that is indeed so. Bolotie requires a large number of sequences for each clade in order to computer a reliable set of probabilities from the data. Each clade needs good representation in order to provide enough support for any variants which help distinguish that clade from the rest.