uio-bmi / graph_peak_caller

ChIP-seq peak caller for reads mapped to a graph-based reference genome
BSD 3-Clause "New" or "Revised" License
18 stars 8 forks source link

Understanding max_paths.intervalcollection structure #3

Closed RicciAle closed 5 years ago

RicciAle commented 5 years ago

Hello everyone! I've performed the same analysis as the one in this link:

[(https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial)]

Now, I don't understand very much what each row of a max_paths.intervalcollection file contains. Can you please help me? I'm attaching one of them in case you need it!

chr4_max_paths.intervalcollection.zip

Thank you in advance and sorry if there are errors in this issue!

ivargr commented 5 years ago

Hi!

Thanks for asking!

The chr*_max_paths.intervalcollection files are the resulting "max paths" through your final peaks. So if you ended up with such files, it means you sucessfully have called peaks. They are max paths in the sense that they are the estimated most likely paths going through the peak subgraphs.

The format is basically a JSON format representing the start offset, end offset and nodes in the max path. For instance, your first peak is:

{"start": 5, "end": 27, "region_paths": [4916842, 4916843, 4916844, 4916845, 4916846, 4916847, 4916848, 4916849, 4916850, 4916851, 4916852, 4916853, 4916854, 4916855, 4916856, 4916857, 4916858, 4916859, 4916860, 4916861, 4916862, 4916863, 4916865, 4916866, 4916867, 4916868, 4916869, 4916870, 4916871, 4916872, 4916873, 4916874, 4916875, 4916876, 4916877, 4916878], "direction": 1, "average_q_value": 108.40100382329935, "unique_id": "None", "is_ambigous": 0, "is_diff": 0}

This means that the peak starts on node 4916842 with offset 5 and follows all the nodes listed, ending at offset 27 on node 4916878.

As you probably have noted, this file in itself is not really useful unless you want to continue analysing these peaks on the graph or convert it in any way. If you want to analyse these peaks on the graph, you can read the file with the following Python code

from graph_peak_caller.peakcollection import PeakCollection
peaks = PeakCollection.from_file("chr4_max_paths.intervalcollection", text_file=True)
for peak in peaks.intervals:
    print(peak.start_position.offset, peak.end_position.offset, peak.region_paths)

You can also project these graph peaks to a linear reference genome, see the Output section here (https://github.com/uio-bmi/graph_peak_caller/blob/master/README.md) for instructions. You will basically need to find the reference path through your graph and project the peaks onto that one.

Don't hesitate to let me know if you meet any problems or still are confused!

RicciAle commented 5 years ago

Thank you for the explanation, it was very clear!