Non-zero exit codes - alignment and consensus stages

xena-maria-luis commented 1 year ago

Hello! I found bolotie while looking for a program to detect recombination and am considering using it as part of my workflow for a research project on SARS-CoV-2. After cloning the repository into my virtual machine, I tested the program with the HIV example data and it worked fine there. However, I've run into a couple errors while trying it out on my dataset and can't seem to figure out what's going on. I have two versions of my genome dataset - non-aligned (first-stage: clean) and previously aligned (first-stage: consensus)

I'm also not sure what the 'variants' file is - if it refers to the query sequences, the cluster file, or something else. If there's any additional information that could be useful apart from that below, let me know.

The data used as input were as follows: (converted to txt for upload here) Ref.fasta.txt bolotie_cluster_nextstrain.tsv.txt The query sequences used were a subset of the GISAID dataset EPI_SET_230703ec

My commands and response look like this:

(Clean start) ./run.py --input ./SARS/pass_seq_bolotie_not_aligned.fasta --reference ./SARS/Ref.fasta --outdir ./res_SARS_pass --keep_tmp --threads 2 --clusters ./SARS/bolotie_cluster_nextstrain.tsv

Running stages: clean, align, consensus, index, path, parents, plot

================== Cleaning sequence names in all inputs Done

================== Aligning sequences command: /home/user/bolotie/bolotie aln -x ./SARS/Ref.fasta -i ./res_SARS_pass/query.fasta -o ./res_SARS_pass/aln -t 2 -l 200 aligning sequences Non-zero exit code in alignment

(Consensus start) ./run.py --input ./SARS/pass_seq_bolotie.fasta --reference ./SARS/Ref.fasta --outdir ./res_SARS_pass --keep_tmp --threads 2 --first-stage consensus --clusters ./SARS/bolotie_cluster_nextstrain.tsv

Running stages: consensus, index, path, parents, plot

================== Building deduplicated consensus sequences command: /home/user/bolotie/bolotie cons -x ./SARS/Ref.fasta -i ./res_SARS_pass/aln.vars.csv -o ./res_SARS_pass/cons4tree -m 100 -d generating consensus sequences Can't open the variants file Non-zero exit code in consensus 4 tree

alevar commented 1 year ago

Hi,

I apologize for the delay in communication. I will be unavailable until August 2nd but will investigate your issue as soon as I am back!

Ales

alevar commented 1 year ago

Hi,

Is it possible for you to provide a subset of the EPI_SET_230703ec you used in your analysis? I am curious to replicate and then address the issue. It is possible there are some exceptions in the data which were not originally accounted for which need a fresh patch.

Thank you and I hope I ca find a fix for your issue so you can successfully run the method.

Thank you,

Ales

xena-maria-luis commented 1 year ago

Hello. Since the query sequence files are too big to directly come through, I've put them up on the link below. Hope this helps. https://drive.google.com/drive/folders/1bg1KKvDnXZOwHhGEpKv1MHnbfrtB0KX6?usp=sharing

Xena.

alevar commented 1 year ago

Hi Xena,

Thank you for providing me with all the data. I was unfortunately unable to replicate the error you've encountered. I used the inputs you provided as follows:

run.py --input ./pass_seq_bolotie.fasta --reference ./Ref.fasta.txt --outdir ./res_SARS_pass --keep_tmp --threads 6 --clusters ./bolotie_cluster_nextstrain.tsv.txt

and

run.py --input ./pass_seq_bolotie_not_aligned.fasta --reference ./Ref.fasta.txt --outdir ./res_SARS_pass --keep_tmp --threads 6 --clusters ./bolotie_cluster_nextstrain.tsv.txt

Both commands worked well and with no errors. The first command found no recombinants while the second one did find 19. Plotting seems to be broken at the moment due to the higher-than-expected number of clades in the input. If you get the method to work on your end and need the plots - I should be able to issue a fix later this week.

May I ask what operating system you are using the software on? And could you try executing commands directly without using the python wrapper script? Namely, could you run the following command:

/home/sparrow/genomicTools/bolotie/bolotie aln -x ./Ref.fasta.txt -i ./res_SARS_pass/query.fasta -o ./res_SARS_pass/aln -t 6 -l 200

I do hope we can resolve your issue and help you run your analysis!

xena-maria-luis commented 1 year ago

Hi.

I'm using the software on Zorin OS Ver. 15.01. Turns out, part of my issues were solved after running on a single thread, and that worked till the 'build' stage with the wrapper script. Without the wrapper script, the commands aln, cons and build would still run but I wasn't able to find the output if I didn't manually create the result folder before running the steps.

Currently, the 'find' mode with the command below shows errors both within the wrapper script (Non-zero exit code in path finding) and as an independent command (Killed), and the result file paths is thus empty. /home/user/bolotie/bolotie find -x ./res_SARS_pass/probmat.probs -i ./res_SARS_pass/cons4prob.fa -p 0.9999 -t 1 -o ./res_SARS_pass/paths -r

In case it's possible, there's an additional version of the analysis that I'd like to perform with a the same sequences but a different clustering with less cluster groups, and that's also stuck at the same stage (find). I've added the current results and the extra cluster file to the folder with the query sequences, in case that helps troubleshooting.

alevar commented 1 year ago

Hi Xena,

Thank you for continued testing of the software and for the examples you provided. The reason for the process being killed by the OS is insufficient memory. Indeed, my earlier changes (increasing precision of the floating point operations) made the index unreasonably large. I have now reverted to a better solution, which should be precise, yet should not exceed expected memory requirements (4GB in your case). The code has been updated on github.

After implementing the change, I tested the data you've provided and can confirm the software runs as expected, reporting multiple putative recombinant genomes for both GISAID and the other clade assignments. It should be noted, however, that increasing the number of clades in the index further increases computational costs. I was able to analyze all 4245 sequences in your data against the 20 clade index using 6 threads in about 1 hour on my laptop. Using the index with the GISAID clades assignments (6 clades) was significantly faster at ~12 minutes using the same setup.

I added a progress bar to the path finding module to help calculate time till completion for each run. Also the plotting has been updated a bit so that it actually works with a high number of clades, at the cost of repeating colors and removing elements (previously probabilities for all clades were plotted, whereas now only those involved in the putative recombination are plotted).

Hope you are able to run our method and get some exciting findings!

Best,

AS

xena-maria-luis commented 1 year ago

Hi Ales,

I've finally been able to confirm that the method has successfully run as intended with my dataset after the update. Thank you so much for your constant support to get this to work!

Kind regards, Xena.

salzberg-lab / bolotie

Non-zero exit codes - alignment and consensus stages #8