mskilab-org / fragCounter

GC and mappability corrected fragment coverage for paired end whole genome sequencing
MIT License
7 stars 11 forks source link

multicoco going wrong #3

Open tlesluyes opened 4 years ago

tlesluyes commented 4 years ago

Hi. There is an issue in the multicoco function where indices for bigger bins are wrongly generated.

In the details: genomic regions are divided into bigger bins (I picked 1000bp windows so I end up with 100,000bp bins). The tricky part is that: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L77 defines a new columns (lev1) where bigger bins are referenced with indices. At that point, the first 100 lines of cov.dt (that is chr1:1-100000) have a value of 2 and the next 100 lines (chr1:100001-200000) have a value of 3. The problem is that the first 100 lines of chromosome 2 also have a value of 3, so those are wrong at that stage because they are shared by multiple chromosomes. A few lines later: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L97 mean values are computed based on this number. At that point, my first value is fine (mean(cov.dt$reads[1:100])==tmp.cov$reads[1]) but the second value represent the mean between chr1:100001-200000 and chr2:1-100000 (mean(cov.dt$reads[c(101:200,which(cov.dt$seqnames==2)[1:100])])==tmp.cov$reads[2]), which makes sense because it does the job but the indices are not correct. The third value is the mean between chr1:200001-300000, chr2:100001-200000 and chr3:1-100000, an so forth for next indices. Also, tmp.cov contains way less lines than expected as it should be total/100 but only is 2473 (which nearly correspond to chr1 size since a single bin is 100,000bp). Since multiple chromosomes are picked for a single index, chromosomes names and IRanges are also wrong in tmp.cov.

This error can be corrected by unsing the commented line: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L73 (instead of the one indicated above). This way, lev1 values are different and are not identical between chromosomes. The mean values are then correctly computed and tmp.cov contains all the chromosomes (not only chr1).

Can you check, confirm and correct that?

imielinski commented 4 years ago

Thanks – we will look into this

From: tlesluyes notifications@github.com Reply-To: mskilab/fragCounter reply@reply.github.com Date: Tuesday, January 21, 2020 at 9:34 AM To: mskilab/fragCounter fragCounter@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [mskilab/fragCounter] muticoco going wrong (#3)

Hi. There is an issue in the multicoco function where indices for bigger bins are wrongly generated.

In the details: genomic regions are divided into bigger bins (I picked 1000bp windows so I end up with 100,000bp bins). The tricky part is that: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L77https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L77__;Iw!!C6sPl7C9qQ!HltWktHjF7GaKdCnz2diq1wcTv3rbTnfSgVzC39fF6L4WNp9p02TrtGK3mvrpvR_vkY$ defines a new columns (lev1) where bigger bins are referenced with indices. At that point, the first 100 lines of cov.dt (that is chr1:1-100000) have a value of 2 and the next 100 lines (chr1:100001-200000) have a value of 3. The problem is that the first 100 lines of chromosome 2 also have a value of 3, so those are wrong at that stage because they are shared by multiple chromosomes. A few lines later: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L97https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L97__;Iw!!C6sPl7C9qQ!HltWktHjF7GaKdCnz2diq1wcTv3rbTnfSgVzC39fF6L4WNp9p02TrtGK3mvrK9TNThM$ mean values are computed based on this number. At that point, my first value is fine (mean(cov.dt$reads[1:100])==tmp.cov$reads[1]) but the second value represent the mean between chr1:100001-200000 and chr2:1-100000 (mean(cov.dt$reads[c(101:200,which(cov.dt$seqnames==2)[1:100])])==tmp.cov$reads[2]), which makes sense because it does the job but the indices are not correct. The third value is the mean between chr1:200001-300000, chr2:100001-200000 and chr3:1-100000, an so forth for next indices. Also, tmp.cov contains way less lines than expected as it should be total/100 but only is 2473 (which nearly correspond to chr1 size since a single bin is 100,000bp). Since multiple chromosomes are picked for a single index, chromosomes names and IRanges are also wrong in tmp.cov.

This error can be corrected by unsing the commented line: https://github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R#L73https://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/blob/575af9926e5177a39b45a31ad37048953a680ca4/R/fragCounter.R*L73__;Iw!!C6sPl7C9qQ!HltWktHjF7GaKdCnz2diq1wcTv3rbTnfSgVzC39fF6L4WNp9p02TrtGK3mvrDKsIwGU$ (instead of the one indicated above). This way, lev1 values are different and are not identical between chromosomes. The mean values are then correctly computed and tmp.cov contains all the chromosomes (not only chr1).

Can you check, confirm and correct that?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/mskilab/fragCounter/issues/3?email_source=notifications&email_token=ABFUFY6JUGBRKOIQP6IGMQTQ64BWNA5CNFSM4KJUOFLKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHU2MWQ__;!!C6sPl7C9qQ!HltWktHjF7GaKdCnz2diq1wcTv3rbTnfSgVzC39fF6L4WNp9p02TrtGK3mvrrpNK9bw$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/ABFUFY2KHGIHZB6GKIBTR53Q64BWNANCNFSM4KJUOFLA__;!!C6sPl7C9qQ!HltWktHjF7GaKdCnz2diq1wcTv3rbTnfSgVzC39fF6L4WNp9p02TrtGK3mvrY9KNdek$.


This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.