ijuric / MAPS

18 stars 11 forks source link

Could you describe the content of the file pointed to in $genomic_feat_filepath #14

Open mblanche opened 4 years ago

mblanche commented 4 years ago

Hi, looking at the run_pipeline.sh shell script, I see that you have hard coded the location of this text file:

'../MAPS_data_files/hg38/genomic_features/F_GC_M_MboI_10Kb_el.GRCh38.txt'

I can't find from your paper what this file is for, it's content and how it's being used.

Could help us understand?

Thanks

armenabnousi commented 4 years ago

Hello,

The genomic features file contain statistics like GC content, mappability, and fragment length for each of the bins on the genome. Based on the genome and the bin size you are using, it will refer to a different file.

Thanks!

mblanche commented 4 years ago

I see, it’s the same files used by HiCNorm?

What is the fragment length refereeing to exactly? Digging from the HiCNorm paper, it seems to be related to the restriction enzyme used. Could you extend a bit more? What would be the effect of using a different enzyme or multiple enzymes to the linear regression model used for the normalization?

From: Armen Abnousi notifications@github.com Reply-To: ijuric/MAPS reply@reply.github.com Date: Monday, October 14, 2019 at 4:49 PM To: ijuric/MAPS MAPS@noreply.github.com Cc: Marco Blanchette marcoblanchette@icloud.com, Author author@noreply.github.com Subject: Re: [ijuric/MAPS] Could you describe the content of the file pointed to in $genomic_feat_filepath (#14)

Hello,

The genomic features file contain statistics like GC content, mappability, and fragment length for each of the bins on the genome. Based on the genome and the bin size you are using, it will refer to a different file.

Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

armenabnousi commented 4 years ago

it's the average size of the fragments between two consecutive cutsites based on the enzyme used. Please refer to the figure 1d in this paper I believe the longer the fragments are, the efficiency of the process decreases. I will have to ask my colleague for a better explanation tomorrow.

ijuric commented 4 years ago

So, we know that the density of RE cut sites is positively correlated with read count, and we try to capture that with effective fragment length. This is how it's calculated: For a given bin you find all the RE fragments lengths (RE fragment length is the distances between two consecutive RE cut sites). Then, for each RE fragment length that is longer than 1000, you replace it's length by 1000. Next, you sum up all RE fragment lengths. That is your effective fragment length. We do truncating at 1000 because we expect most reads to be within 500 bp from the nearest cut site, just due to the nature of the experiment. This means that if the distance between two RE cut sites is 4000 bp (so fragment length = 4000), most of our reads will be in 1000 bp region around first cut site (500 to left and 500 to right) and 1000 bp region around the second cut site. In between, we expect to see no much (ideally, nothing, I think). We would expect to see the same thing if the dist between two RE cut sites is 3000 bp. Again, reads would be only within 1kb region from each cut site with not much going on in the middle. This means that we should treat fragment length of 4000 the same way as 3000. And that is why we truncate all frags longer than 1kb to 1kb length.

Using different RE (or combinations of REs) would change effective length and that would change the regression coefficient associated with the effective length, which in turn would affect the expected count. I don't know how exactly it would change it, I guess that would depend on the distribution of effective lengths and the observed counts. (Lame answer, I know. Sorry.) Also, if you use multiple REs, then your effective lengths can get large. I didn't play much with that since we were using data sets that were cut with only one RE.

On Mon, Oct 14, 2019 at 8:26 PM Armen Abnousi notifications@github.com wrote:

it's the average size of the fragments between two consecutive cutsites based on the enzyme used. Please refer to the figure 1d in this paper http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Yaffe2011Probabilistic.pdf I believe the longer the fragments are, the efficiency of the process decreases. I will have to ask my colleague for a better explanation tomorrow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ijuric/MAPS/issues/14?email_source=notifications&email_token=ACZ6SZLOURACK24Q6M6ZYCTQOUE4ZA5CNFSM4JAU4XM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBHANAA#issuecomment-541984384, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ6SZODUWVF5HDG4PQP463QOUE4ZANCNFSM4JAU4XMQ .

mblanche commented 4 years ago

Thanks, that make sense. So one could modify the genomic feature file to accommodate a given restriction digest, right?