uio-bmi / graph_peak_caller

ChIP-seq peak caller for reads mapped to a graph-based reference genome
BSD 3-Clause "New" or "Revised" License
18 stars 8 forks source link

Graph peak caller returns negative node ids #10

Closed cgroza closed 3 years ago

cgroza commented 3 years ago

Hi, I am having a strange error with graph_peak_caller when calling create_linear_map: The command is: graph_peak_caller create_linear_map -g graphs/chr20.nobg

INFO:root:Using sequencegraph graphs/chr20.nobg.sequences
2020-10-25 19:16:51,355, INFO: Getting topologically sorted nodes
2020-10-25 19:16:51,356, INFO: 0 nodes processed
2020-10-25 19:16:51,356, INFO: Finding starts and ends
2020-10-25 19:16:51,357, INFO: 0 nodes processed
Traceback (most recent call last):
  File "/home/cgroza/.local/bin/graph_peak_caller", line 11, in <module>
    sys.exit(main())
  File "/home/cgroza/.local/lib/python3.7/site-packages/graph_peak_caller/command_line_interface.py", line 36, in main
    run_argument_parser(sys.argv[1:])
  File "/home/cgroza/.local/lib/python3.7/site-packages/graph_peak_caller/command_line_interface.py", line 673, in run_argument_parser
    args.func(args)
  File "/home/cgroza/.local/lib/python3.7/site-packages/graph_peak_caller/preprocess_interface.py", line 79, in create_linear_map_interface
    create_linear_map(graph, out_name)
  File "/home/cgroza/.local/lib/python3.7/site-packages/graph_peak_caller/util.py", line 6, in create_linear_map
    linear_map = LinearMap.from_graph(ob_graph)
  File "/home/cgroza/.local/lib/python3.7/site-packages/graph_peak_caller/control/linearmap.py", line 107, in from_graph
    starts = cls.find_starts(graph, node_ids)
  File "/home/cgroza/.local/lib/python3.7/site-packages/graph_peak_caller/control/linearmap.py", line 137, in find_starts
    max_dists[j] = max(cur_dist, max_dists[j])
IndexError: index -246491878 is out of bounds for axis 0 with size 2 

What could be causing this error? This is a graph generated with the latest version of vg construct. I attached the chr20.vg file to this Dropbox link: https://www.dropbox.com/s/jgxby6pkwa5xywi/chr20.vg?dl=0

Any guidance would be greatly appreciated!

ivargr commented 3 years ago

Thanks for posting, and sorry for not responding before -- I'm looking into this now!

ivargr commented 3 years ago

I see that there are some edges that has both from_start True and to_end True, which I think must mean that you have som cyclic stuff going on. Her is an example I found while just briefly skimming the json file,:

{"from": "123715999", "from_start": true, "to": "123716002", "to_end": true}

Could it be something wrong with either vg version 1.27 or the pipeline you are using that makes you end up with a cyclic graph (I assume you havn't intended to have a cyclic graph?) I'm still looking into why graph peak caller crashes, but think maybe this could be the reason.

cgroza commented 3 years ago

Hi, Thank you for the reply!

As far as I know, vg construct is not meant to build cycles in the graph. Moreover, checking for cycles with vg stats -A

cgroza@beluga4:~/.../Epi_10X/graphs $ vg stats -A chr20.vg
acyclic
cgroza@beluga4:~/.../Epi_10X/graphs $

shows that the graph is not cyclic. Could the cycles be introduced along the way during conversion to json/nobg?

Thanks, Cristian

ivargr commented 3 years ago

Hmm, it might not be cycles that are the problem, but the fact that you have edges going from the start of one node to the end of another node. Is that the intention? How did you create the graph? From a vcf and linear reference?

cgroza commented 3 years ago

Yes, I have created the graph using a VCF file, vg construct and a linear reference (hg19). I did not make any choices on the topology, but it is what is currently being done in vg construct.

ivargr commented 3 years ago

Would you be able to share the VCF you used with me so that I can test (feel free to email it if you don't want to share it here)?

cgroza commented 3 years ago

Hi,

I emailed the VCF file for chromosome 20. Did you have any success in replicating the error on your end?

ivargr commented 3 years ago

Sorry for being a bit slow responding here.

I recieved the VCF and I have tried to reproduce the error using vg version 1.27, but strangely I am not able to reproduce the error. Do you specify any parameters to vg construct or do anything else with the graph?

Here is what I am running (I am using a ref.fa containing only chr 20 on hg19):

vg construct -r ref.fa -v chr20.vcf > test.vg
vg view -Vj test.vg > test.json
graph_peak_caller create_ob_graph test.json 
graph_peak_caller create_linear_map -g test.nobg

And that seems to work fine. I see that my test.vg is a little bit different in terms of number of edges than the graph you sent me:

$ vg stats -z test.vg
nodes   2828767
edges   3209054
$ vg stats -z chr20.vg 
nodes   2827356
edges   3202936
cgroza commented 3 years ago

Hi, I think I identified the problem. It has to do with the new graph formats that vg introduced recently. Graphs in the HashGraph format or converted from HashGraph to VG-Protobuf using vg convert have the same error. On the other hand, graphs that are produced directly as VG-Protobuf are working fine.

Here is the kicker: When joining the node id-space using vg ids -j, it converts the format from VG-Protobuf to HashGraph. The test graph I submited here was constructed using vg construct then had its node space joined with the rest of the chromosomes. I converted it back to VG-Protobuf.

It seems the conversion process introduces weird things in the graph that makes GraphPeakCaller crash. Reproduce by:

vg construct -r ref.fa -v chr20.vcf > test.vg
# Protobuf
vg stats -F test.vg
# now join the node id space, as we need to do for a whole genome graph
vg ids -j test.vg
# format changed!
vg stats -F test.vg
vg view -j test.vg > test.json
graph_peak_caller create_ob_graph test.json 
graph_peak_caller create_linear_map -g test.nobg

Converting back to VG-Protobuf after vg ids -j does not help. The graph seems altered permanently.

glennhickey commented 3 years ago

I see that there are some edges that has both from_start True and to_end True, which I think must mean that you have som cyclic stuff going on. Her is an example I found while just briefly skimming the json file,:

{"from": "123715999", "from_start": true, "to": "123716002", "to_end": true}

@cgroza While this is not indicative of a cycle (or even an inversion), I'm guessing these backwards-represented edges are the issue because graph-peak-caller is not expecting them. This should be easy enough to fix as {"from": "123715999", "from_start": true, "to": "123716002", "to_end": true} is equivalent to {"from": "123716002", "to": "123715999"}.

I agree it's a bit annoying that the canonical edge orientation in the new vg formats isn't necessarily consistent with what it used to be (ie forward when possible). Will check if there's something to be done, but for us (in the vg/bidrected sequence graph world) the two above representations are equivalent.

On a somewhat related note: vg now has an official Python API which can read the binary graph formats natively.

ivargr commented 3 years ago

Ah, thanks a lot you two for figuring this out!

I've pushed a fix to pyvg to handle this format change now. @cgroza, could you try do the following (you wont need to update graph peak caller, only pyvg which is a library used by graph peak caller):

# uninstall current pyvg
pip3 uninstall pyvg
# install latest version I pushed now
pip3 install pyvg==1.1.3

After doing that, you will need to create the graph again with graph_peak_caller create_ob_graph ... and then create_linear_map should work (it worked when I tested just now).

And we should definetely change to using vg's official Python API in the future! :)

cgroza commented 3 years ago

Yes, I will update this issue as soon as I try it. Thank you both for fixing this bug!