lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
476 stars 131 forks source link

Biostar404363 – unequal AF instead of ambiguity code #226

Closed JMIBarth closed 1 year ago

JMIBarth commented 1 year ago

I am using the tool “Biostar404363” (introduce artificial mutations in a bam file) to simulate mutations with an allele frequency of 0.5 and I noticed that whenever there is an unequal number of reads, the base in the first listed read is changed to ambiguity code. Yet, for my application, I would rather need a slightly unequal frequency in cases of unequal read number (e.g., 10 times base A and 11 times base T) instead of the ambiguity coding (10 A, 10 T, and one W).

Would be possible to implement such an option that lets you choose to have either ambiguity code or a slightly unequal allele frequency in cases of unequal read number?

lindenb commented 1 year ago

instead of the ambiguity coding (10 A, 10 T, and one W).

I don't really understand , what is that 'W' ? where does it come from. You could provide a minimum example please.

JMIBarth commented 1 year ago

Apologies for not being precise. I assume its the IUPAC ambiguity code for the bases A, C, G, T – where "W" has the meaning that it can be either A or T. In the code on line 112 (https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar404363.java) is an example for allele frequency (AF) 0.5 where "M" is introduced instead of base C or A.

lindenb commented 1 year ago

where "M" is introduced instead of base C or A.

sorry, that's still not clear to me

the line you're pointing is the consensus line, it's not part of any read.

JMIBarth commented 1 year ago

OK, that explains it. Thanks and sorry for the confusion. I think I didn't quite understand how the program works. Could you please clarify, whether bases of half of the reads are replaced by the alternative allele (comment on line 108 in the script), or if each read is replaced by an alternative allele with probability 0.5 (if AF=0.5, see comment line 95 in the script)? Also, is there a rule regarding forward and reverse reads, i.e., equal frequency of the introduced mutation in forward and reverse reads? Thanks a lot for your help!

lindenb commented 1 year ago

AF=0.5

I use a random number generator . If the random number is lower than the value for AF ( https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar404363.java#L264 ), then the ALT allele with replace the original base in the read https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar404363.java#L268

Also, is there a rule regarding forward and reverse reads

no

JMIBarth commented 1 year ago

Thanks a lot for the swift response!