Number of sites in VCF does not correspond to -M setting?

gwct commented 2 years ago

Describe the bug This isn't really a bug, just more likely a need for clarification. But let's say I specify -M 0.1. I was expecting then that 10% of sites in the input FASTA file would then be labeled in the golden VCF file as SNPs. However, when I do this for a single chromosome of length 61420004, I only find 1164343 unique sites with SNPs in the golden VCF file. This is only 1.9% of the total sites.

To Reproduce Steps to reproduce the behavior: python gen_reads.py -r {input.fa} -o {prefix} --bam --vcf -R 150 --pe 300 30 -c 20 -M 0.1 zgrep -v "#" {prefix}_golden.vcf.gz | cut -f 1,2 | sort | uniq | wc -l

Expected behavior My expectation was that 10% of sites would have SNPs with setting -M 0.1, but I'm probably misunderstanding something.

Any clarification would be helpful! Thanks!

gwct commented 2 years ago

Hello, Any insights to this issue would be appreciated. Thanks!

joshfactorial commented 2 years ago

We're replicating this issue and we'll provide some insight soon. Thank you for your patience.

joshfactorial commented 2 years ago

Okay, I've looked into this in greater detail. This is a part of the code that is a carryover from Zach Stephens original research, so I'm not 100% familiar with all the math they originally came up with. We're looking to improve the statistical modeling, but as it functions now, it uses the input mutation rate to develop a Poisson distribution of the possible number of mutations to add. For example, If you have a thousand bases, with 0.1 mutation rate, then the number of variants to add will be somewhere between about 0 and ~150. The exact number is a fraction of the maximum number of mutations (0.3*sequence length), with a filter for very unlikely events (the max and min for example). Then, to determine how many mutations to add, it samples from that Poisson Distribution. So the number of mutations is probabilistic based on the mutation rate entered.

joshfactorial commented 2 years ago

Improving this method will be one of the avenues for future development.

gwct commented 2 years ago

Ok, thanks for the explanation. I guess I'm still a little confused. I've run this multiple times with the same -M 0.1 setting and am consistently getting around 2% of sites with variants in the golden VCF file. I understand that sampling sites for mutations should be probabilistic so it wouldn't be exactly some number each time, but this consistent low sampling means I'm probably still not understanding. Does -M 0.1 mean, on average, 10% of sites should have mutations inserted that show up in the golden VCF?

This excerpt from the paper is what made me think this is the case:

If no such BED file is provided, the mutations will be inserted across the window 
such that the total number of mutations is determined by multiplying the length 
of the window by the desired overall mutation rate.

So if -M is 0.1 and the window size is 1000bp, that should be (again on average) 100 positions with mutations inserted?

Thanks for helping me parse through this.

gwct commented 2 years ago

I want to note also that -M does seem to be scaling with different values as expected. If I give it a smaller value, say -M 0.02, I get a lower fraction of sites with mutations inserted. In this case it is consistently around 0.8% which is lower than the fraction with -M 0.1 as expected (though in this case I would have expected 2%).

joshfactorial commented 2 years ago

Right. One wrinkle might be if you have a lot of N bases, like in the telomere sections of HG38. NEAT basically skips those. I think also there were refinements to the code after the paper was written. For example, they incorporated the trinucleotide context into deciding how and when to mutate a base. What I can do for now, while we work to refine that part of the code long term, is check to see that we didn't introduce an error in transitioning from Python2 to Python3.

joshfactorial commented 2 years ago

I checked the numbers versus the original code and they also seem to skew low. Basically, it develops a Poisson distribution centered around the 10% mark (for M = 0.1) and then randomly samples from that to get the actual number. It could be there's a problem with either the poisson distribution setup or the sampling technique used. It also spreads the mutation rate across ploids, snps and indels. So for example, if it has determined that the percentage of total mutations that were SNPs is 15%, then it will multiply the mutation rate times the length of the sequence, times 0.5 per ploid, times 15% for SNPs and 85% for indels. The total number of SNPs. But I think you have to take the ploid into account too when adding them up.

For example if NEAT indicates WP=1/1, then that means it has the ALT on both alleles, which counts as 2 mutations as far as NEAT is concerned.

joshfactorial commented 2 years ago

I think I may have tracked down a bug that causes the number of snps/indels to skew lower than it should. I'll try out some fixes and see if that makes the final mutation rate closer to what we expect.

joshfactorial commented 2 years ago

Please pull the latest version of master and re-run and see if the counts are more in line with the input mutation rate. There was a problem with the way the Poisson distribution was set up, which was skewing all the values too low.

joshfactorial commented 2 years ago

I ran this on a small dataset (H1N1 virus, HA contig, length 1701 bases): python gen_reads.py -r /home/joshfactorial/Documents/neat_data/H1N1/H1N1_HA.fa -R 150 -o /home/joshfactorial/Documents/new_neat --vcf --no-fastq --pe 300 30 -c 20 -M 0.1 zgrep -v "#" new_neat_golden.vcf.gz | wc -l 176 I ran several iterations: zgrep -v "#" new_neat_golden.vcf.gz | wc -l 157 zgrep -v "#" new_neat_golden.vcf.gz | wc -l 146 All came out to be in the 10% range

gwct commented 2 years ago

Re-running the new version with -M 0.1 results in 9.8% of sites with mutations. Definitely more in line with my expectations! I'll let you know if there are any inconsistencies between runs or with different values of -M, but this seems to have fixed the problem. Thanks!

gwct commented 2 years ago

Hi again, Unfortunately I'm now getting an error when running gen_reads.py. Strangely enough, it works for one reference file but not another. I know this likely means an issue with the files, but I was hoping you could help me track it down based on the error that's coming out. This was not happening before I pulled to update with the fix for -M that we've been discussing, so I thought I'd keep it in this issue, but I'm happy to move this to another issue if that is preferable.

When I run this command with one reference file:

python gen_reads.py -r <REFERENCE FASTA 1> -o <OUTPUT PREFIX 1> --bam --vcf -R 150 --pe 300 30 -c 20 -M 0.02

The program starts simulating reads:


Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (150), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
Index not found, creating one...
0.549 (sec)
reading 19...
59.066 (sec)
--------------------------------
sampling reads...

However the same command with a different reference:

python gen_reads.py -r <REFERENCE FASTA 2> -o <OUTPUT PREFIX 2> --bam --vcf -R 150 --pe 300 30 -c 20 -M 0.02

results in the following error:

Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (150), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
Index not found, creating one...
1.500 (sec)
reading 18...
Traceback (most recent call last):
  File "/n/home07/gthomas/projects/Mapping-simulations/pkgs/NEAT/gen_reads.py", line 891, in <module>
    main()
  File "/n/home07/gthomas/projects/Mapping-simulations/pkgs/NEAT/gen_reads.py", line 411, in main
    (ref_sequence, n_regions) = read_ref(reference, ref_index[chrom], n_handling)
  File "/n/home07/gthomas/projects/Mapping-simulations/pkgs/NEAT/source/ref_func.py", line 136, in read_ref
    temp = MutableSeq(my_dat)
NameError: name 'MutableSeq' is not defined

Do you have any ideas why this might be happening? Thanks!

joshfactorial commented 2 years ago

Yeah that's an import error. I will push a fix for that this afternoon

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Gregg Thomas @.> Sent: Friday, February 11, 2022 10:07:23 AM To: ncsa/NEAT @.> Cc: Allen, Joshua @.>; State change @.> Subject: Re: [ncsa/NEAT] Number of sites in VCF does not correspond to -M setting? (Issue #38)

Hi again, Unfortunately I'm now getting an error when running gen_reads.py. Strangely enough, it works for one reference file but not another. I know this likely means an issue with the files, but I was hoping you could help me track it down based on the error that's coming out. This was not happening before I pulled to update with the fix for -M that we've been discussing, so I thought I'd keep it in this issue, but I'm happy to move this to another issue if that is preferable.

When I run this command with one reference file:

python gen_reads.py -r <REFERENCE FASTA 1> -o <OUTPUT PREFIX 1> --bam --vcf -R 150 --pe 300 30 -c 20 -M 0.02

The program starts simulating reads:

Using default sequencing error model. Warning: Read length of error model (101) does not match -R value (150), rescaling model... Using default gc-bias model. Using artificial fragment length distribution. mean=300, std=30 Index not found, creating one... 0.549 (sec) reading 19... 59.066 (sec)

sampling reads...

However the same command with a different reference:

python gen_reads.py -r <REFERENCE FASTA 2> -o <OUTPUT PREFIX 2> --bam --vcf -R 150 --pe 300 30 -c 20 -M 0.02

results in the following error:

Using default sequencing error model. Warning: Read length of error model (101) does not match -R value (150), rescaling model... Using default gc-bias model. Using artificial fragment length distribution. mean=300, std=30 Index not found, creating one... 1.500 (sec) reading 18... Traceback (most recent call last): File "/n/home07/gthomas/projects/Mapping-simulations/pkgs/NEAT/gen_reads.py", line 891, in main() File "/n/home07/gthomas/projects/Mapping-simulations/pkgs/NEAT/gen_reads.py", line 411, in main (ref_sequence, n_regions) = read_ref(reference, ref_index[chrom], n_handling) File "/n/home07/gthomas/projects/Mapping-simulations/pkgs/NEAT/source/ref_func.py", line 136, in read_ref temp = MutableSeq(my_dat) NameError: name 'MutableSeq' is not defined

Do you have any ideas why this might be happening? Thanks!

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/ncsa/NEAT/issues/38*issuecomment-1036369089__;Iw!!DZ3fjg!sxUwZAhDQejZgeS1kku3tL1aw3A04XgZQ649HD0D7Fx0dI7IfTbjxSUcBNfDvL0665fP$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AGMI726JJLLKLWWHMVE6SATU2UX3XANCNFSM5G5ZU2LQ__;!!DZ3fjg!sxUwZAhDQejZgeS1kku3tL1aw3A04XgZQ649HD0D7Fx0dI7IfTbjxSUcBNfDvKh5IsH0$. You are receiving this because you modified the open/close state.Message ID: @.***>

gwct commented 2 years ago

Great, thanks! Any idea why it would be having the problem with one reference file but not the other?

joshfactorial commented 2 years ago

I uploaded a fix. The call to MutableSeq that had no corresponding import was in the section of the code that reads the ref. I suspect your first reference may have already had an index file from a previous run and so was able to skip that section of the code, but running it on a new fasta called the function that was bugged. The latest version of master should be fully functional.

joshfactorial commented 2 years ago

Let me know if you have further problems.

gwct commented 2 years ago

A quick check shows that both reference files now get to the "sampling reads step..." so the fix appears to be working. And that makes sense as to why one was working and the other wasn't -- the one that was working was the one that I was using to test before. Thanks!

ncsa / NEAT

Number of sites in VCF does not correspond to -M setting? #38

Using default sequencing error model. Warning: Read length of error model (101) does not match -R value (150), rescaling model... Using default gc-bias model. Using artificial fragment length distribution. mean=300, std=30 Index not found, creating one... 0.549 (sec) reading 19... 59.066 (sec)