Closed meganlsmith closed 1 year ago
Hey Megan! I think I understand the question but let me check. So you have UCE data and the calculation that you made that gave ~800K is the total length of all UCEs combined, right? You're right that the 4475 value in the monomorphic bin from easySFS is the number of polymorphic sites that became monomorphic after downsampling. In my understanding, the zero bin of the sfs should not "count" the complete sequence length as monomorphic sites, but I could be wrong!
In truth, I think what to do with the zero bin is different depending on what method you're using, so like with stairwayplot2 you don't include the zero bin, but you do tell it the total sequence length, or with fastsimcoal you can optionally include or exclude the zero bin, given some constraints, but I can't remember the exact fsc2 interpretation of the zero bin (whether it does or does not include ALL monomorphic sites from the full sequence). With a typical vcf file you often won't know the full sequence length from which the variable sites were extracted (for instance this is the case with RADSeq data). You probably know about this stuff as well as I do, so if I'm doing this wrong please let me know and we'll get it fixed!
What are you planning to do with the sfs downstream? Hope you are doing good!
Hi Isaac,
Thanks for the quick response! That is what I was asking.
Hmm, I think the monomorphic bin should include complete sequence length (corrected for downsampling) minus the number of polymorphic sites, but perhaps my thinking is wrong here. I am using the SFS to estimate parameters in fsc2. I typically don’t use the monomorphic cell for model selection, but do use it to estimate parameters. I looked at some simulations from fsc2 to try and get a better idea of what exactly it uses for the monomorphic cell.
I simulated 1526 loci of length 558 bp (so in total 851508 sites). The monomorphic cell of the SFS has 845865 in the simulation, which corresponds to that total number of sites minus the number of polymorphic SNPs in the SFS. So I think fsc is expecting the monomorphic cell to include total non-polymorphic sequence length.
Of course, it may be that different programs are expecting different things in this case.
And I hadn’t thought about the fact that many VCFs wouldn’t have this info. Of course that complicates things!
Thanks again, and hope things are going well for you!
Best, Megan
Megan Smith NSF Postdoctoral Fellow Department of Biology Indiana University, Bloomington, IN
On Jan 25, 2022, at 11:16 AM, Isaac Overcast @.***> wrote:
Hey Megan! I think I understand the question but let me check. So you have UCE data and the calculation that you made that gave ~800K is the total length of all UCEs combined, right? You're right that the 4475 value in the monomorphic bin from easySFS is the number of polymorphic sites that became monomorphic after downsampling. In my understanding, the zero bin of the sfs should not "count" the complete sequence length as monomorphic sites, but I could be wrong!
In truth, I think what to do with the zero bin is different depending on what method you're using, so like with stairwayplot2 you don't include the zero bin, but you do tell it the total sequence length, or with fastsimcoal you can optionally include or exclude the zero bin, given some constraints, but I can't remember the exact fsc2 interpretation of the zero bin (whether it does or does not include ALL monomorphic sites from the full sequence). With a typical vcf file you often won't know the full sequence length from which the variable sites were extracted (for instance this is the case with RADSeq data). You probably know about this stuff as well as I do, so if I'm doing this wrong please let me know and we'll get it fixed!
What are you planning to do with the sfs downstream? Hope you are doing good!
— Reply to this email directly, view it on GitHub https://github.com/isaacovercast/easySFS/issues/47#issuecomment-1021355969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DGNXVIJTICVKDC4EL7RTUX3EEBANCNFSM5MYTCJJQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.
Hey Megan,
Yeah, this is a use case I had not considered, and different downstream programs do expect different things. When I had run fsc2 in the past I had always configured it to generate independent unlinked snps (I think using the FREQ data type, but it's been a while so I don't have a perfect memory of it). I would be super happy to modify easySFS to produce the zero bin output that you are after. What do you think is the best way to do it?
Given that vcf files often will not have information about total sequence length, one option I can think of is to add a command line flag (-l
for example) to pass in the total sequence length, and then we can do the calculation internally to determine counts in the zero bin taking into account the # of polymorphic sites. I could imagine this being a pretty straightforward modification of the current code. Can you think of any other ways we could do it? Happy to hear other ideas!
Hope you are doing well! -isaac
Hi Isaac,
I think you’re correct that the best way to do this is to add a command line flag to pass in total sequence length. Not sure what changes would be needed for people using unlinked SNPs, since that sort of downsampling should probably be considered when calculating the monomorphic sites. It might be helpful just to note that in the documentation, and leave it to the user to correct for any prior downsampling when providing total sequence length.
Thanks for replying so quickly to this!
Best, Megan
Megan Smith NSF Postdoctoral Fellow Department of Biology Indiana University, Bloomington, IN
On Jan 25, 2022, at 9:46 PM, Isaac Overcast @.***> wrote:
Hey Megan,
Yeah, this is a use case I had not considered, and different downstream programs do expect different things. When I had run fsc2 in the past I had always configured it to generate independent unlinked snps (I think using the FREQ data type, but it's been a while so I don't have a perfect memory of it). I would be super happy to modify easySFS to produce the zero bin output that you are after. What do you think is the best way to do it?
Given that vcf files often will not have information about total sequence length, one option I can think of is to add a command line flag (-l for example) to pass in the total sequence length, and then we can do the calculation internally to determine counts in the zero bin taking into account the # of polymorphic sites. I could imagine this being a pretty straightforward modification of the current code. Can you think of any other ways we could do it? Happy to hear other ideas!
Hope you are doing well! -isaac
— Reply to this email directly, view it on GitHub https://github.com/isaacovercast/easySFS/issues/47#issuecomment-1021810434, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DGNW4Z2IMYDWCLCFWFJ3UX5N7BANCNFSM5MYTCJJQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.
Hi!
Allow me to step into the conversation, as I am coming across with the exact same problem as Megan. In my case, I am interested in generating the folded SFS and use it later for both stairwayplot2 and fsc2, and I am also obtaining extremely low counts for the monomorphic sites.
This is particularly problematic in my case as I am interested in the 2D SFS between populations, so the solution is not as simple as simply modifying the one-line SFS to correct the number of monomorphic sites. I think including a -l option is definitely the most straightforward way to incorporate the number of monomorphic sites that won't appear in the VCF files outputted by ipyrad.
Kind regards, Aleix
Hey Aleix,
Yeah, I looked at the easySFS code to see if it's a quick fix, and it's not exactly a quick fix. It'll take a little work, and at the moment I'm under a couple very pressing deadlines, so I probably won't get to it until March at the earliest. For the moment, to get you going, stairwayplot2 actually doesn't use the zero bin (you have to remove it manually anyway), and fsc2 has the --removeZeroSFS
which is what I've typically used in the past. Hopefully you can make forward progress with these things in the time before I get around to adding the -l
flag for easySFS. Good luck!
-isaac
Hi Isaac,
Thanks for the quick reply! What you propose for running both programs makes a lot of sense. I think it will be possible to proceed in the meantime.
Good luck, Aleix
I added a new --total-length
argument with takes the value of the full sequence length used to generate the input VCF and updates the zero bins of the sfs accordingly. Please test and let me know if it works or doesn't work for you.
Hi Isaac!
I'm using easySFS to build a site frequency spectrum from a vcf file that was produced with GATK (and edited in VCFtools).
It seems that the population of the monomorphic cell may be a bit off. The monomorphic cell contains 4474.7 sites, and there are ~5750 non-monomorphic sites. I think 4475 is actually the number of sites that were polymorphic in the VCF but were monomorphic after downsampling. When I calculate the total sequence length from the VCF, and then multiply that by the proportion of polymorphic sites retained after downsampling, I get ~800 K, and I think this (minus the number of polymorphic sites) should be the number I'm using to populate the monomorphic cell.
I just wanted to check with you about whether there is something about the vcf format I'm using (attached), since I know that easySFS was designed for use with RAD-style datasets. Thanks for any insight you have!
Command to run easySFS: python /N/u/mls16/Quartz/programs/easySFS/easySFS.py -i ./data/All_SNPs/panmixia_native_SNP_75p.recode.vcf -a -p ./data/easysfs_pops_native.txt --proj 276
data.zip