sandberg-lab / dataprivacy

GNU General Public License v3.0
14 stars 4 forks source link

How do I confirm anonymization of bam files #4

Open mropri92 opened 1 year ago

mropri92 commented 1 year ago

Hi,

Thanks you for this awesome tool. Just a question I had was, how would I confirm that after running BAMboozle on my bam file that it actually has been anonymized? Appreciate any help.

cziegenhain commented 1 year ago

Hi,

If you want to be super precise, I recommend to pileup all reads over all positions using samtools mpileup http://www.htslib.org/doc/samtools-mpileup.html

mropri92 commented 1 year ago

Thank you for your help. Just to make sure I understand, once I run mpileup on my before BAMboozle bam file and mpileup on my after BAMboozle bam file, I can compare the base calls in the read based column? Sorry for my naivety, and am very thankful for your help.

cziegenhain commented 1 year ago

No problem! You don't necessarily need to run mpileup on the original bam file, since you are only really interested in to check that there is no more bases mismatching to reference after running BAMboozle!

mropri92 commented 1 year ago

Sounds good, thanks for your help and time!

mropri92 commented 1 year ago

I was able to use mpileup to look at one bam file and determine that it was anonymized. However, this was tedious. Is there a quicker, more streamlined way to determine if my file is anonymized? I only ask as I have to do this for multiple files. Sorry for the bother again. And as always, appreciate your help.

cziegenhain commented 1 year ago

Hi, Not sure what you mean with tedious here. You could think about just starting multiple/all mpileup calls in parallel.

mropri92 commented 1 year ago

Hi,

Sorry for not being clear on the tedious part. I guess when you advised to check that there are no more bases mismatching to the reference after running BAMboozle, I took it to mean to check for every position in the bam file. Is there a specific position that we are just looking for to check that there are no mismatches?

cziegenhain commented 1 year ago

Exactly, the proper way of verifying would indeed be to look at all covered positions!

mropri92 commented 1 year ago

Gotcha, thank you again for helping me with this and taking the time to explain. Appreciate your help!

sropri commented 1 year ago

Jut a further question for clarification. If we are checking in our mpileup output that after running BAMboozle there should be no more mismatching bases to the reference genome, wouldn't SNPs be called as mismatching? And wouldn't this give you some number of mismatching bases? How does BAMboozle account for that? Appreciate your help.

cziegenhain commented 1 year ago

Hi,

Not sure if I understand your question correctly. Since the goal of BAMboozle is to remove any donor genetic variation, there should be no more SNPs present in the data after BAMboozling the reads.

sropri commented 1 year ago

Oh, so the SNPs are anonymized as well to remove donor genetic variation. Sorry for the confusion. I just had one quick question as well. After the mpileup output file, is there a command that you use to just give you the number of mismatches contained in the output file? Sorry for my question, am just new to samtools and appreciate your help.

sropri commented 1 year ago

Because from my understanding, I should be looking in the 5th column of the mpileup output file for any occurences of ACGTN or agctn as they represent mismatches. Thus, I was running this command:

file="output.txt"

Count occurrences of "ACGTN" or "acgtn" in the 5th column

total_occurrences=$(awk -F'\t' '{count += gsub(/[ACGTNacgtn]/, "");} END{print count}' "$file")

echo "Total occurrences of ACGTN or acgtn in the 5th column: $total_occurrences"

Appreciate your guidance in this

cziegenhain commented 1 year ago

The text format of mpileup is a bit difficult to parse I think, so for me the easiest way is to use the bcftools branch which has VCF format support like this:

  1. bcftools mpileup --fasta-ref your_ref.fa --max-depth 10000 --threads 10 --output-type z --output bamboozle.mpileup.vcf.gz your_bamboozle_output.bam
  2. bcftools view --min-alleles 3 --threads 10 -O z -o bamboozle.mpileup.filter.vcf.gz bamboozle.mpileup.vcf.gz

The filtered output VCF should then be completely empty (other than the header).

Hope that helps.

sropri commented 1 year ago

This great. Thanks for your help. Will try this out and sorry for the bother.