tariks / peakachu

Genome-wide contact analysis using sklearn
MIT License
57 stars 9 forks source link

How to transform our HiC matrix data or processed reads to the training .bedpe file. #10

Open Yulong663 opened 4 years ago

Yulong663 commented 4 years ago
 Hi Xiaotao! It is a brilliant idea that introduce the machine learning framework into loop detections. 
 Recently i'm cope with Hi-C data and try to use the peakachu to identify loops. While i'm confused by the format of training data, i'm not sure what the seven column of the .bedpe file is. Could you tell me what the seven column of the .bedpe file is or add some explanation on the corresponding part of README file ?
Thanks a lot and look forward to your reply.  
XiaoTaoWang commented 4 years ago

Hi, the first 6 columns of the bedpe file are just interaction coordinates (https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format). The 7th column is optional and will not be used in the training. Let me know if it's not clear.

Yulong663 commented 4 years ago

Hi, the first 6 columns of the bedpe file are just interaction coordinates (https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format). The 7th column is optional and will not be used in the training. Let me know if it's not clear.

Thanks for the clarification. Another question is how to transfom data into positive and negative training data? Does the positive training data was transformed based on the location of pre-existed loop identified by other Loop-calllers? Thanks.

tariks commented 4 years ago

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

Yulong663 commented 4 years ago

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

Thanks for the clarification how the positive and negative training set works. From the method part in the published paper what i get is that positive training set is a set of coordinates that around loops. And that is what i'm really asking: how to choose a set of positive training coordinates? And what does the "a set of high-confidence manually selected interactions" mean.. the high-confidence is confident for what ? (for loop?) Thanks a lot for your prompt reply 👍

Yulong663 commented 4 years ago

You only provide a positive set. any set of coordinates will be accepted, but ideally this should be from either an experiment different from the one you're training for (it makes less sense to use HiC loops to train a HiC caller, this probably introduces bias) or a set of high-confidence manually selected interactions. the negative set is automatically generated by using random coordinates that follow a distance distribution based on the properties of the training set. specifically, th e negative set contains a set of distance-matched random coordinates and another set of random longer-distance interactions. the former help to train a model that produces similar loops as the positive set source, and the latter helps to distinguish loops from the general background (fewer false positives at longer range).

hope that explains how the training set is used.

By the way, is it a good idea to use the training set provided in repository to train tmy own model?

tariks commented 4 years ago

Positive set will reflect garbage-in-garbage-out philosophy. peakachu is intended to receive coordinates determined by another technology (chia pet, etc) in the same cell type. When we tried manually selecting 200 obvious loops from the same HiC map, we got comparable results. Ideally, Peakachu tries to answer the question "can I find loops in X experiment that are similar to Y experiment." The training sets included in the repo are cell-line specific for the training step, but the resulting model can be applied to any cell type at a similar read depth.

Yulong663 commented 4 years ago

Positive set will reflect garbage-in-garbage-out philosophy. peakachu is intended to receive coordinates determined by another technology (chia pet, etc) in the same cell type. When we tried manually selecting 200 obvious loops from the same HiC map, we got comparable results. Ideally, Peakachu tries to answer the question "can I find loops in X experiment that are similar to Y experiment." The training sets included in the repo are cell-line specific for the training step, but the resulting model can be applied to any cell type at a similar read depth.

Thanks tariks :)