uio-bmi / graph_peak_caller

ChIP-seq peak caller for reads mapped to a graph-based reference genome
BSD 3-Clause "New" or "Revised" License
19 stars 8 forks source link

create_ob_graph #12

Closed sdws1983 closed 1 year ago

sdws1983 commented 1 year ago

Hi!

I'm following the pipeline to create the ob-graphs for each chromosome separately. When I was running graph_peak_caller create_ob_graph index/1.json, the following error occurred:

:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject 2022-12-26 18:48:04,351, INFO: Creating obgraph (graph and sequencegraph) 2022-12-26 18:48:04,351, INFO: Creating ob graph from json file 2022-12-26 19:12:28,226, WARNING: Node id 1 is not int. Converting to int when creating graph. 2022-12-26 19:12:28,226, WARNING: Node id 1 is not int. Converting to int when creating graph. 2022-12-26 19:12:49,112, INFO: Min node: 1, Max node: 64099746 2022-12-26 19:12:49,112, INFO: Reading from json 2022-12-26 19:45:28,785, INFO: Creating numpy adj lists 2022-12-26 19:50:26,056, INFO: Writing ob graph to file 2022-12-26 19:50:26,057, INFO: Saving to numpy format 2022-12-26 19:52:54,918, INFO: Graph saved to index/1.nobg 2022-12-26 19:52:54,918, INFO: Creating sequence graph Traceback (most recent call last): File "/share/home/miniconda3/bin/graph_peak_caller", line 33, in sys.exit(load_entry_point('graph-peak-caller', 'console_scripts', 'graph_peak_caller')()) File "/share/home/software/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 39, in main run_argument_parser(sys.argv[1:]) File "/share/home/software/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 679, in run_argument_parser args.func(args) File "/share/home/software/graph_peak_caller/graph_peak_caller/preprocess_interface.py", line 61, in create_ob_graph sequence_graph_v2 = obg.SequenceGraphv2.create_empty_from_ob_graph(ob_graph) File "/share/home/miniconda3/lib/python3.9/site-packages/offsetbasedgraph/sequencegraphv2.py", line 46, in create_empty_from_ob_graph sequence_array = np.array(["X" * size for size in ob_graph.blocks._array], dtype=str) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 15.3 TiB for an array with shape (64099747,) and data type

This is just a pan-genome graph of one of the chromosomes, created according to the cactus-minigraph pipeline. If my understanding is correct, it seems that my graph is too complicated for numpy to create such a large data array?

If this is indeed the reason, how can I run this step successfully?

Looking forward to your reply.

Yumin

ivargr commented 1 year ago

Hi!

I'm afraid the code for creating the graphs is a bit outdated now, but might still work as long as the graphs are somewhat "nice". I think the problem here is that you have one (or several) very big node (size 65507), and the graph creation code tries to create a NumPy array with enough space assuming every node can be this big (this is a bit naive). So one solution would be to make sure that your graph has smaller nodes. Theres an option in vg construct for this, and maybe possible to do with a vg command after the graph has been created.

That said, I've been wanting to do some fixes/updates to use another graph library that would handle this much better. If splitting the nodes doesn't solve this issue for you, I'll try to see if I can update the code. So let me know how it goes :)

ivargr commented 1 year ago

It might be that you are missing paths. Would you be able to share the Chr1.prune.vg file with me? Then I could look into it.

tir. 3. jan. 2023 kl. 07:33 skrev Yumin Huang @.***>:

Hi!

I'm afraid the code for creating the graphs is a bit outdated now, but might still work as long as the graphs are somewhat "nice". I think the problem here is that you have one (or several) very big node (size 65507), and the graph creation code tries to create a NumPy array with enough space assuming every node can be this big (this is a bit naive). So one solution would be to make sure that your graph has smaller nodes. Theres an option in vg construct for this, and maybe possible to do with a vg command after the graph has been created.

That said, I've been wanting to do some fixes/updates to use another graph library that would handle this much better. If splitting the nodes doesn't solve this issue for you, I'll try to see if I can update the code. So let me know how it goes :)

Yes, I made some modifications for my vg files like this:

~/software/vg mod -t 80 -X 256 Chr1.vg > Chr1.mod.vg ~/software/vg prune -t 80 -k 45 Chr1.mod.vg > Chr1.prune.vg ~/software/vg view --threads 80 -j Chr1.prune.vg > index/1.json graph_peak_caller create_ob_graph index/1.json ~/software/vg stats --threads 80 -r Chr1.prune.vg | awk '{print $2}' > index/node_range_1.txt

And it seems okay now.

But now some new problems have emerged, I'd like to find the linear path with the command: graph_peak_caller find_linear_path -g index/1.nobg index/1.json Chr1 index/1_linear_pathv2.interval

The following errors were raised:

  • graph_peak_caller find_linear_path -g index/1.nobg index/1.json Chr1 index/1_linear_pathv2.interval :228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject DEBUG:root:Reading Offset Based Graph from file: index/1.nobg INFO:root:Using sequencegraph index/1.nobg.sequences Traceback (most recent call last): File "/share/home/miniconda3/bin/graph_peak_caller", line 33, in sys.exit(load_entry_point('graph-peak-caller', 'console_scripts', 'graph_peak_caller')()) File "/share/home/software/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 39, in main run_argument_parser(sys.argv[1:]) File "/share/home/software/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 679, in run_argument_parser args.func(args) File "/share/home/software/graph_peak_caller/graph_peak_caller/analysis/analysis_interface.py", line 333, in find_linear_path linear_path = create_linear_path(graph, vg_graph, File "/share/home/software/graph_peak_caller/graph_peak_caller/analysis/util.py", line 71, in create_linear_path ref_path = obg.NumpyIndexedInterval.from_interval(linear_paths[path_name]) KeyError: 'Chr1'

It seems my path name is mismatched with the graph file. So I wonder what the path names that need to be matched. I added two lines of commands (indicated by arrows) for the /share/home/software/graph_peak_caller/graph_peak_caller/analysis/util.py python script like this, so it can output the path name:

[image: @.*** https://user-images.githubusercontent.com/18738943/210308083-be0046f7-4277-4d83-addf-6c6291bd5e78.png

Then I rerun the graph_peak_caller find_linear_path, but I found it output nothing but [] {}.

So, it seems to be due to the lack of paths in the json file? Could it be something wrong with my vg file?

— Reply to this email directly, view it on GitHub https://github.com/uio-bmi/graph_peak_caller/issues/12#issuecomment-1369433929, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHBQWYR6CJOOZSKJN3KQALWQPB4JANCNFSM6AAAAAATJUI36Y . You are receiving this because you commented.Message ID: @.***>

sdws1983 commented 1 year ago

Sorry, I deleted the previous comment, because I seem to have solved that problem, it is indeed a problem with vg prune that the path is not included in the output vg file, so I deleted this step. After that, I ran the following command without reporting an error like previous one:

~/software/vg mod -t 80 -X 256 Chr8.vg > Chr8.mod.vg
~/software/vg view --threads 80 -j Chr8.mod.vg > index/8.json
graph_peak_caller create_ob_graph index/8.json
graph_peak_caller find_linear_path -g index/8.nobg index/8.json "A#1#Chr8" index/8_linear_pathv2.interval

But another error occurred:

<frozen importlib._bootstrap>:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
DEBUG:root:Reading Offset Based Graph from file: index/8.nobg
INFO:root:Using sequencegraph index/8.nobg.sequences
2023-01-03 17:44:19,016, INFO: Reading vg graph from json file index/8.json
2023-01-03 17:51:11,730, INFO: Done reading vg graph
Path name: A#1#Chr8
Path name: B#1#Chr8
Path name: C#1#Chr8
Path name: D#1#Chr8
Path name: E#1#Chr8
Path name: F#1#Chr8
2023-01-03 17:51:55,645, INFO: Creating indexes for indexed interval
2023-01-03 17:51:56,038, WARNING: Region paths in linear path is not sorted asc. Linear path can be buggy.
2023-01-03 17:51:56,038, WARNING: Region paths in linear path is not sorted asc. Linear path can be buggy.
2023-01-03 17:51:56,044, WARNING: Max node in path is not the last region path (last is 5806110 and max is 5625595). Path is not sorted.
2023-01-03 17:51:56,044, WARNING: Max node in path is not the last region path (last is 5806110 and max is 5625595). Path is not sorted.
2023-01-03 17:51:56,045, WARNING: Min node is not the first node. First node is 346 and min is -5806046
2023-01-03 17:51:56,045, WARNING: Min node is not the first node. First node is 346 and min is -5806046
2023-01-03 17:51:56,128, INFO: Interval length: 74566169
2023-01-03 17:51:56,128, INFO: Interval end offset: 64
2023-01-03 17:51:56,128, INFO: Interval end nnode size: 64
2023-01-03 17:51:56,128, INFO: Distance to last node end: 102689836
Traceback (most recent call last):
  File "/share/home/miniconda3/bin/graph_peak_caller", line 33, in <module>
    sys.exit(load_entry_point('graph-peak-caller', 'console_scripts', 'graph_peak_caller')())
  File "/share/home/software/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 39, in main
    run_argument_parser(sys.argv[1:])
  File "/share/home/software/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 679, in run_argument_parser
    args.func(args)
  File "/share/home/software/graph_peak_caller/graph_peak_caller/analysis/analysis_interface.py", line 333, in find_linear_path
    linear_path = create_linear_path(graph, vg_graph,
  File "/share/home/software/graph_peak_caller/graph_peak_caller/analysis/util.py", line 73, in create_linear_path
    ref_path = obg.NumpyIndexedInterval.from_interval(linear_paths[path_name])
  File "/share/home/miniconda3/lib/python3.9/site-packages/offsetbasedgraph/indexedinterval.py", line 259, in from_interval
    assert index_positions[-1] < interval.length(), \
AssertionError: Interval is length 74566169, but distance to last node end is 102689836

Is this caused by vg mod changing the number of nodes? How to solve it?

Yumin

ivargr commented 1 year ago

I think the issue here is that the node-space of your graph/paths is not sorted, which unfortunately graph peak caller requires.

Could you try to add -c to the vg mod command to tell vg to sort the id-space in the first command you pasted above, i.e.:

vg mod -t 80 -X 256 -c Chr8.vg > Chr8.mod.vg

(I have little experience with using graphs built with cactus, and I am a bit unsure whether they will work in the end, but hopefully we will be able to get this to work :))

sdws1983 commented 1 year ago

I think the issue here is that the node-space of your graph/paths is not sorted, which unfortunately graph peak caller requires.

Could you try to add -c to the vg mod command to tell vg to sort the id-space in the first command you pasted above, i.e.:

vg mod -t 80 -X 256 -c Chr8.vg > Chr8.mod.vg

(I have little experience with using graphs built with cactus, and I am a bit unsure whether they will work in the end, but hopefully we will be able to get this to work :))

I added -c, but the errors like the previous ones still happened, including the warnings:

<frozen importlib._bootstrap>:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
2023-01-03 20:17:43,599, WARNING: Node id 1833455 is not int. Converting to int when creating graph.
<frozen importlib._bootstrap>:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
DEBUG:root:Reading Offset Based Graph from file: index/8.nobg
INFO:root:Using sequencegraph index/8.nobg.sequences
2023-01-03 20:37:25,199, WARNING: Region paths in linear path is not sorted asc. Linear path can be buggy.
2023-01-03 20:37:25,207, WARNING: Max node in path is not the last region path (last is 5805800 and max is 5618087). Path is not sorted.
2023-01-03 20:37:25,208, WARNING: Min node is not the first node. First node is 346 and min is -5805736
Traceback (most recent call last):
  File "/share/home/off_huangyumin/miniconda3/bin/graph_peak_caller", line 33, in <module>
    sys.exit(load_entry_point('graph-peak-caller', 'console_scripts', 'graph_peak_caller')())
  File "/share/home/off_huangyumin/software/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 39, in main
    run_argument_parser(sys.argv[1:])
  File "/share/home/off_huangyumin/software/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 679, in run_argument_parser
    args.func(args)
  File "/share/home/off_huangyumin/software/graph_peak_caller/graph_peak_caller/analysis/analysis_interface.py", line 333, in find_linear_path
    linear_path = create_linear_path(graph, vg_graph,
  File "/share/home/off_huangyumin/software/graph_peak_caller/graph_peak_caller/analysis/util.py", line 73, in create_linear_path
    ref_path = obg.NumpyIndexedInterval.from_interval(linear_paths[path_name])
  File "/share/home/off_huangyumin/miniconda3/lib/python3.9/site-packages/offsetbasedgraph/indexedinterval.py", line 259, in from_interval
    assert index_positions[-1] < interval.length(), \
AssertionError: Interval is length 74566169, but distance to last node end is 102622479

Then I test whether -c indeed make changes for vg file, but it seems -c not worked for my vg, the md5 and file size of output file for added -c or not are the same:

(base) [off_huangyumin@s002 pggb]$ ~/software/vg mod -t 80 -X 256 Chr8.vg > Chr8.mod.vg
(base) [off_huangyumin@s002 pggb]$ md5sum Chr8.mod.vg
b045c632962552ee55f82f72f3743ee7  Chr8.mod.vg
(base) [off_huangyumin@s002 pggb]$ ~/software/vg mod -c -t 80 -X 256 Chr8.vg > Chr8.mod.vg
(base) [off_huangyumin@s002 pggb]$ md5sum Chr8.mod.vg
b045c632962552ee55f82f72f3743ee7  ROC22.Chr8.mod.vg

Maybe the vg file has been sorted? Or I ignored other bugs?

Yumin

ivargr commented 1 year ago

Thanks for trying!

It seems a bit strange that -c does not change the graph at all. Would you be able to share the Chr8.vg graph with me so I could have a look?

sdws1983 commented 1 year ago

Thanks for trying!

It seems a bit strange that -c does not change the graph at all. Would you be able to share the Chr8.vg graph with me so I could have a look?

Can you share your email? I will send the file to you :)

ivargr commented 1 year ago

Sure! My email is ivargry@ifi.uio.no :)

sdws1983 commented 1 year ago

Excuse me, I want to check if you have received my email 3 days ago, I'm afraid I didn't send it successfully.

ivargr commented 1 year ago

Hi!

I'm sorry for the late answer! I got your email and have had a look at the graph.

The graph has some properties that is not supported by Graph Peak Caller. A main problem is that paths traverse nodes both ways.

I think I'm afraid that in general, Graph Peak Caller will not support graphs built from multiple sequence alignment. If we fix the issue with nodes being traversed both ways, I think there are likely to be other problems as well.

I see that your graph is built from 6 individuals. I think a good option in your case is to just to linear peak calling against those 6 individuals, and then do some post-processing of the peaks. Maybe something like this:

sdws1983 commented 1 year ago

Thank you so much for your patience with my questions! It's clear you've spent a lot of time on this one which I really admire.

Not long after your response, I tried to use the vg file obtained by the cactus-minigraph process to build an index (the vg that was sent to you earlier was obtained by using the pggb process). The command used is as follows:

~/software/vg mod -t 80 -X 256 -c Chr6.renamed.vg > Chr6.mod.vg
~/software/vg view --threads 80 -j Chr6.mod.vg > index/6.json
graph_peak_caller create_ob_graph index/6.json
~/software/vg stats --threads 80 -r Chr6.mod.vg  | awk '{print $2}' > index/node_range_6.txt
graph_peak_caller find_linear_path -g index/6.nobg index/6.json "A#Chr6" index/6_linear_pathv2.interval

And the program does not seem to report the error that occurred before:

2023-01-18 11:43:00,662, INFO: Creating indexes for indexed interval
2023-01-18 11:43:01,877, WARNING: Region paths in linear path is not sorted asc. Linear path can be buggy.
2023-01-18 11:43:01,894, WARNING: Max node in path is not the last region path (last is 21121312 and max is 11882955). Path is not sorted.
2023-01-18 11:43:02,118, INFO: Interval length: 80599198
2023-01-18 11:43:02,119, INFO: Interval end offset: 250
2023-01-18 11:43:02,119, INFO: Interval end nnode size: 250
2023-01-18 11:43:02,119, INFO: Distance to last node end: 80598948
2023-01-18 11:43:14,338, INFO: Wrote to file index/6_linear_pathv2.interval

I'm not sure if this means my build was successful? I wonder if those warning messages above (Max node in path is not the last region path) are important, or can they be ignored? And does vg mod -c need to be executed?

Yumin

ivargr commented 1 year ago

It seems it built without failing at least, but I'm unfortunately afraid that the graph will not work well. Graph Peak Caller displays a warning about the linear path not being sorted. This will likely cause some problems in the peak calling pipeline, since Graph Peak Caller relies on a sorted linear path that alignments can be projected unto.

So I still think that your best option would maybe be to map to and call peaks on each of the sequences that the graph is built from, and try to postprocess the peaks (merge "duplicates"), if you want to do some kind of graph-based peak calling.