sandberg-lab / dataprivacy

GNU General Public License v3.0
14 stars 4 forks source link

Application to bisulfite sequencing data #5

Open GMFranceschini opened 10 months ago

GMFranceschini commented 10 months ago

Thank you for developing this tool; I think it might be very useful for sharing our research data. I want to ask you for advice in applying bamboozle to bisulfite converted data.

Bisulfite treatment introduces mismatches at virtually any cytosine in the genome. Alignment is performed ignoring all Cs or against a bisulfite-converted genome. Ideally, I would like to remove all mismatches that do not involve a C in the reference. Is this possible? Do you have any previous experience in anonymizing bisulfite-converted data? Maybe inserting an N in any position affected in the reference file could work, but I'd like to double-check with you. It would be great to expand the use cases to this data type.

Thank you in advance for any support you can provide, Best

cziegenhain commented 10 months ago

Hi,

Thanks a lot for your interest and getting in touch! So far, there is no specific mode for bisulfite-converted data foreseen or planned in bamboozle. One thing you could consider is to apply bamboozle regularly and separately keep a list of all the mismatches to C bases. However, it is likely for human donors that such a list will include some real SNPs that are not derived from the bisulfite treatment and such lead to a partial disclosure of genetic information on the sequenced individual. Whether this would be sufficient to infer eg. the identity of a donor in the future is unclear and would require some further risk calculations.

Hope this helps as a starter for further investigation on the topic!

All the best, Christoph

GMFranceschini commented 10 months ago

Thank you a lot for taking the time to answer me! I'll see what I can do, I think your idea is a great starting point. If I am not asking too much (and only if you have the time) could you tell me which code lines of the main script check for mismatches? I saw the sections commented for splicing and other steps, but I couldn't understand where SNPs are removed, which is the part I shall work on. Sorry if I am not familiar with pysam, it is totally ok if you just close the issue! Best

GMFranceschini commented 9 months ago

Here is a small update on this matter: I think I managed to create a bisulfite-compatible version of the bamboozle script, leaving only CpG sites intact and converting everything else to reference (and sacrificing Cs in other contexts). Still, SNPs can affect CpGs, so not bisulfite compatible bases are converted to Ns. There might still be some information there if we consider a genome-wide assay, as one could still infer the genotype by guessing the variant with a higher MAF. I'll let you know if this progresses any further, in the meantime thank you for providing this solid starting point!

cziegenhain commented 9 months ago

That sounds very nice, happy to hear that you are finding the basis of BAMboozle useful!

4 okt. 2023 kl. 16:18 skrev Gian M. Franceschini @.***>:

Here is a small update on this matter: I think I managed to create a bisulfite-compatible version of the bamboozle script, leaving only CpG sites intact and converting everything else to reference (and sacrificing Cs in other contexts). Still, SNPs can affect CpGs, so not bisulfite compatible bases are converted to Ns. There might still be some information there if we consider a genome-wide assay, as one could still infer the genotype by guessing the variant with a higher MAF. I'll let you know if this progresses any further, in the meantime thank you for providing this solid starting point!

— Reply to this email directly, view it on GitHub https://github.com/sandberg-lab/dataprivacy/issues/5#issuecomment-1746969019, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6PWY43WDIJYYIHO4J7ZBDX5VVZTAVCNFSM6AAAAAA4SGMIHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWHE3DSMBRHE. You are receiving this because you commented.