GC + mappability corrections

tlesluyes commented 4 years ago

Hi. I’m not 100% sure whether this is an actual issue but I have some questions regarding the gc and mappability corrections.

The first question is: why don’t you consider using a single loess correction instead of two? That would be a single loess(reads~gc+map,data=whatever) instead of two loess(reads~covariate,data=whatever). Any statistical advices/advantages doing this? https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L208
The second questions is about the correction factors that are computed right after the segmentation: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L115

So, those factors are computed early but applied later, during covariate corrections. The first covariate is GC, where you: 1) subsample bins to get 50,000 of them (that is x2s), 2) correct those with pre-computed factors, 3) try to fit a loess regression on those, 4) try to fit a loess regression on the entire dataset (that is x2) if the previous one fails and 5) apply correction factors for GC. Then the same process is performed for the mappability. Is this correct? Does that imply that only subsampled bins are actually corrected for pre-computed factors at: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L206 Because this seems to only be performed on the subset (x2s) and not on the whole dataset (x2). So, if the regression fails at that stage, then the other one that is performed in the entire dataset uses non-corrected read values because x2 has not been adjusted. If things go really wrong, then the two fits will fail fot the two covariates and coverages will never be adjusted with these factors. But what if things go right, are coverage values actually corrected twice? This correction is inside the loop that iterates for each covariate so it seems like it’s applied everytime. Am I reading this correctly and is this wanted? Would that make sense to adjust read coverages for the entire dataset with these pre-conputed factors first (a single time) and then correct for covariates (no matter if you use a single or two corrections and if the regression fails with subsampled regions)?

imielinski commented 4 years ago

Thanks - it’s been a while since this code was actually written so I will try my best

The iterative loess code is actually adapted from Gavin Ha’s HMMcopy package – agreed multivariate loess might be more appropriate though I imagine sparsity would be the main issue. ie each multi-dimensional bin might have too few data points. Loess can get unstable in my experience and then can give very odd results. Doing this kind of iterative “marginal” correction is perhaps safer

But short answer yes the numbers first get adjusted for GC, then the adjusted numbers get re-adjusted for mappability. In my experience the mappability correction does not do much.

I agree that the tryCatch is a bit funky, would be better for the code to just fall on its sword if there is an error.

BTW I’m loving the detailed and thoughtful code review – can we hire you???

From: tlesluyes notifications@github.com Reply-To: mskilab/fragCounter reply@reply.github.com Date: Thursday, January 23, 2020 at 7:05 AM To: mskilab/fragCounter fragCounter@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [mskilab/fragCounter] GC + mappability corrections (#4)

Hi. I’m not 100% sure whether this is an actual issue but I have some questions regarding the gc and mappability corrections.

The first question is: why don’t you consider using a single loess correction instead of two? That would be a single loess(reads~gc+map,data=whatever) instead of two loess(reads~covariate,data=whatever). Any statistical advices/advantages doing this? https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L208 https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L208__;Iw!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-JqAOZ8s$
The second questions is about the correction factors that are computed right after the segmentation: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L115 https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L115__;Iw!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-Re5MwDA$

So, those factors are computed early but applied later, during covariate corrections. The first covariate is GC, where you: 1) subsample bins to get 50,000 of them (that is x2s), 2) correct those with pre-computed factors, 3) try to fit a loess regression on those, 4) try to fit a loess regression on the entire dataset (that is x2) if the previous one fails and 5) apply correction factors for GC. Then the same process is performed for the mappability. Is this correct? Does that imply that only subsampled bins are actually corrected for pre-computed factors at: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L206 https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L206__;Iw!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-E6E32aI$ Because this seems to only be performed on the subset (x2s) and not on the whole dataset (x2). So, if the regression fails at that stage, then the other one that is performed in the entire dataset uses non-corrected read values because x2 has not been adjusted. If things go really wrong, then the two fits will fail fot the two covariates and coverages will never be adjusted with these factors. But what if things go right, are coverage values actually corrected twice? This correction is inside the loop that iterates for each covariate so it seems like it’s applied everytime. Am I reading this correctly and is this wanted? Would that make sense to adjust read coverages for the entire dataset with these pre-conputed factors first (a single time) and then correct for covariates (no matter if you use a single or two corrections and if the regression fails with subsampled regions)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/issues/4?email_source=notifications&email_token=ABFUFY3C4MHXQ6LP4REL7ADQ7GBXTA5CNFSM4KKVLTN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IIHEVIQ__;!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-ynaIk40$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/ABFUFY65QRH7WRPMBXNXEKLQ7GBXTANCNFSM4KKVLTNQ__;!!C6sPl7C9qQ!ENqjISQQbNF5sHnfNN0KFj0j8x9XjJy0DHUxdpqBfTAwJYYkhYx1DNVmL-6-JgfPHrY$.

This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

tlesluyes commented 4 years ago

No problem, happy to help! I wanted to have a clear understanding of what fragCounter and Dryclean actually perform as I have a strong interest for cleaning signal from CNAs. I'm glad my review will somewhat improve those tools. :)

PS: I really enjoy my current position so I'm afraid you cannot hire me (yet?) ;)

BW

mskilab-org / fragCounter

GC + mappability corrections #4