Multi-allelic vcf may cause error in building index

ASLeonard commented 2 years ago

Hi, I've been trying to build my own index bundle to use with kage, but keep encountering errors along the way. I believe the most recent on is due to the vcf input being multi-allelic. The vcf I was using as input is actually made through the pangenie suggested pipeline with AF tags added in after.

The error I'm getting is

variants 2021-12-08 13:58:50,019 INFO: Read 15000000 variants from file (time: 23.229). 618168 variants added
variants 2021-12-08 13:58:50,488 INFO: Stoppinng reading file since limiting to chromosome and now on new chromosome
graph 2021-12-08 13:58:50,588 INFO: 0 variants processed
Traceback (most recent call last):
  ...
.../obgraph-0.0.7-py3.8.egg/obgraph/variants.py", line 157, in get_variant_allele_frequency
    af = float(info_field.split("AF=")[1].split(";")[0])
ValueError: could not convert string to float: '0.0625,0.125,0.125,0.125'

So is the best way forward to norm the vcf to be biallelic, or is there a way to handle multi-allelic in kage?

thanks, Alex

ivargr commented 2 years ago

Hi!

Thanks for asking! Yes, the index creation step does only support biallelic VCFs for now (since we want to represent each variant using two nodes in the graph), so it would be a good idea to convert the VCF to biallelic before doing anything. This should have been mentioned in the description.

The pipeline for creating indexes is unfortunately not very well tested or documented as for now (I guess you are the first to use it), so I won't be suprised if things won't be straight-forward. However, I'm very happy to help you creating indexes. Feel free to reach out here or to me on email (ivargry@ifi.uio.no) if you want some assistance or run into other problems.

ivargr commented 2 years ago

If you want to share the VCF and reference genome you want to create indexes for, I'll be happy to try to create the indexes (will be useful for debugging/trying out the pipeline with other data than we have used until now).

ASLeonard commented 2 years ago

Normalising the vcf helped, but now running into a new error about a recursion limit. There were previously many errors about deletion paths not being correct.

dummy_node_adder 2021-12-08 16:26:15,292 INFO: Ignoring deletion path [914] because ref pos at end is not correct
dummy_node_adder 2021-12-08 16:26:15,292 INFO: Ignoring deletion path [361259, 361261, 916] because ref pos at end is not correct
dummy_node_adder 2021-12-08 16:26:15,292 INFO: Ignoring deletion path [914, 915] because ref pos at end is not correct
Traceback (most recent call last):
...
.../obgraph/mutable_graph.py", line 106, in find_nodes_from_node_that_matches_sequence
    result = MutableGraph.find_nodes_from_node_that_matches_sequence(self, possible_next, new_sequence, variant_type, new_nodes_found, all_paths_found)
  [Previous line repeated 986 more times]
.../obgraph-0.0.7-py3.8.egg/obgraph/mutable_graph.py", line 86, in find_nodes_from_node_that_matches_sequence
    if sequence == "":
RecursionError: maximum recursion depth exceeded in comparison

I'll give it another look, but will prepare the vcf to be shared if you are better able to debug the index creation.

kage-genotyper / kage

Multi-allelic vcf may cause error in building index #2