Closed vilperte closed 7 months ago
Hi @vilperte,
so, if I am understanding you correctly, you simply want to merge the per-position nucleotide counts of several files into one (by adding up the numbers), in order to get the same effect that merging the fastqs would have? That is a command that I intended to write anyway, and can prioritize that now, if there is need ;-)
Also, I would recommend to not use variant callers on your data. See our preprint, where we examine the effects of using typical variant callers (that are written with the statistics of individual-based sequencing in mind) on pooled sequencing data. They give you some skews in the counts due to their statistical processing. Instead, you can simply process the mapped files (bam or the like) directly, and avoid this intermediate step - that should give cleaner results. Or is there a particular reason why you want to use (individual-based) variant callers?
Cheers and so long Lucas
Hi Lucas,
Thanks for the reply. In principle it is one multi-sample vcf, and I would like to create groups of those samples (e.g. group1 contains samples A, B and C, and group2 contains samples D, E and F). The advantage there is that I could create different groups on demand, and calculate the different diversity measures, or even calculate FST between groups. If I work on BAM level, I would have to keep merging the bam files and generate a huge amount of tmp data.
I have processed the data using the pooled option in GATK (although optimized for human data) and I am now comparing the vcf based versus bam based results from grenedalf. But I am aware that the results from mapping files are more accurate.
Oh I see, so you particularly want to avoid creating files. My suggestion to have a command for merging files into a new one then won't work?
Well, so instead of files, doing this "on the fly" while reading the files is of course perfectly possible, and not even that hard to implement in my code base. However, the issue that I see is how to elegantly provide a command line interface for that, so that users are not overwhelmed or confused by that... I guess, a simple table file might work, with two columns, one for the sample names of your input, and one where each name is assigned to a group name, which will then be used as the new name for that merged "sample". What do you think? Seems useful to me! But that would be a bit more work to implement - I'll see when I get to this then.
The table options seems great and super easy to handle, it would definitely be useful. If I understood well, it would look similar to the table below:
Group | Sample_Name |
---|---|
GroupA | Sample1 |
GroupA | Sample2 |
GroupA | Sample3 |
GroupB | Sample4 |
GroupB | Sample5 |
GroupB | Sample6 |
And the output of running grenedalf diversity
would look like.
chrom | start | end | GroupA.snp_count | GroupA.coverage_fraction | GroupA.theta_pi_abs | GroupA.theta_pi_rel | GroupA.theta_watterson_abs | GroupA.theta_watterson_rel | GroupA.tajimas_d | GroupB.snp_count | GroupB.coverage_fraction | GroupB.theta_pi_abs | GroupB.theta_pi_rel | GroupB.theta_watterson_abs | GroupB.theta_watterson_rel | GroupB.tajimas_d |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chr1 | 10001 | 20000 | 28 | 0.081 | 18.274612 | 0.0225056798 | 21.4851791 | 0.0264595802 | -0.0934498091 | 165 | 0.711 | 142.083395 | 0.0199892228 | 142.364211 | 0.0200287298 | -0.00123580431 |
chr1 | 20001 | 30000 | 124 | 0.278 | 51.7982791 | 0.0186593224 | 66.8736422 | 0.0240899288 | -0.141222423 | 345 | 0.987 | 165.046393 | 0.0167152514 | 174.814639 | 0.0177045411 | -0.0350090809 |
chr1 | 30001 | 40000 | 123 | 0.420 | 61.0260734 | 0.0145438688 | 71.1787826 | 0.0169634849 | -0.0893563396 | 184 | 0.981 | 83.4635721 | 0.0085071422 | 90.6548843 | 0.00924012683 | -0.0496978445 |
Exactly! Would that solve your issue? I thought a bit more about this, and can probably implement that without too much effort or computational overhead ;-)
Yes, that would work pretty well :)
Okay, done. The dev
branch now provides an option --sample-group-merge-table-file
as discussed above. Please check it out, and let me know if that works for you!
(Small warning: I am restructuring the internal code quite a bit at the moment, so I hope I didn't break anything. I briefly tested, and it works fine, but let me know should you encounter any trouble. More thorough testing will follow once the restructuring is complete)
Hi Lucas,
Thanks for implementing it so quick, that's great! I will test it in the next days and report it back ;)
Hi again Lucas,
I have tested the merging option and it worked pretty well. The only thing I noticed is the huge amount of RAM needed.
I have currently a bcf with 400 samples (~144M variants) and, in the standard mode, it took 8 hours to run the diversity
subcommand (--measure all
option), with 16 threads and around 10GB of RAM, and with measures for all 400 samples.
For the same bcf, I used the option --sample-group-merge-table-file
to create two groups with around 200 samples per group and it was not able to run with the full bcf. After splitting the calculations per chromosome (each chromosome with around 60MB), it managed to finish with an average of 5 hours per chromosome, with 16 threads and 150GB of RAM.
Maybe it would be a good idea to mention in the parameter details that a large amount of RAM might be needed.
Thanks a lot for the implementation, I will use it more frequently now.
Cheers,
Vinicius
Hi @vilperte,
thanks for reporting this - that is super weird, as the whole thing is written to be highly optimized for RAM usage. Can you please send me the exact command that you ran? Might be a bug, or some other thing that I might have missed.
Cheers Lucas
Command
grenedalf_dev/bin/grenedalf diversity --vcf-path input.vcf \
--reference-genome-fasta-file reference.fasta \
--filter-region chr1 \
--filter-samples-exclude sampleX \
--sample-group-merge-table-file sample_groups.txt \
--window-type sliding \
--window-sliding-width 500000 \
--pool-sizes 10 \
--filter-sample-min-count 2 \
--filter-sample-min-coverage 2 \
--measure all \
--file-prefix merged_chr1_ \
--out-dir results/merged \
--threads 16 \
--log-file merged_chr1.log
Output log
,--;
\ \ ----.
,-----. ,----.;---. ,--. ,----. \ \ ---. | | ;-.----.
/ ---./ ;-,----. / .---'| , `.| | / .---' ,--' \ \ \ | | | .---'
| ( .---.| ´` .'| `-- | |` ` || `-- / .--. )/ .-. \ | | | `--,
\ --' || |\ \ \ `---.| | \ | \ `---.\ `--' // `-' \| '---.| |`
`----´ '`--' '--' `----'`--' `--: `----' `----' `------'.`------'| |
| |
========================================================////- ,-'
v0.3.0 (c) 2020-2024 by Lucas Czech
Command: grenedalf diversity
Input SAM/BAM/CRAM:
--sam-path []
--sam-min-map-qual 0
--sam-min-base-qual 0
--sam-split-by-rg
--sam-flags-include-all
--sam-flags-include-any
--sam-flags-exclude-all
--sam-flags-exclude-any
Input (m)pileup:
--pileup-path []
--pileup-min-base-qual 0
--pileup-quality-encoding sanger
Input sync:
--sync-path []
Input VCF/BCF:
--vcf-path "input.bcf"
Input frequency table:
--frequency-table-path []
--frequency-table-separator-char comma
--frequency-table-int-factor 0
--frequency-table-freq-is-ref
--frequency-table-chr-column
--frequency-table-pos-column
--frequency-table-ref-base-column
--frequency-table-alt-base-column
--frequency-table-sample-ref-count-column
--frequency-table-sample-alt-count-column
--frequency-table-sample-freq-column
--frequency-table-sample-cov-column
Input Settings:
--multi-file-locus-set union
--reference-genome-fasta-file "reference.fasta"
--reference-genome-dict-file
--reference-genome-fai-file
Sample Names, Groups, and Filters:
--rename-samples-file
--filter-samples-include
--filter-samples-exclude "sampleX"
--sample-group-merge-table-file "sample_groups.txt"
Region Filters:
--filter-region "chr1"
--filter-region-list []
--filter-region-bed []
--filter-region-gff []
--filter-region-map-bim []
--filter-region-vcf []
--filter-region-set union
Numerical Filters:
--filter-sample-min-count 2
--filter-sample-max-count 0
--filter-sample-tolerate-deletions
--filter-sample-min-coverage 2
--filter-sample-max-coverage 0
Window:
--window-type "sliding"
--window-sliding-width 500000
--window-sliding-stride 0
--window-queue-count 0
--window-queue-stride 0
--window-region []
--window-region-list []
--window-region-bed []
--window-region-gff []
--window-region-skip-empty
Settings:
--pool-sizes 10
--measure "all"
Formatting:
--separator-char comma
--na-entry nan
Compatibility:
--popoolation-corrected-tajimas-d
--popoolation-format
Output:
--out-dir "results/merged"
--file-prefix "merged_chr1_"
--file-suffix
--compress
Global Options:
--allow-file-overwriting
--verbose
--threads 16
--log-file "merged_chr1.log"
Run the following command to get the references that need to be cited:
`grenedalf citation Czech2023-grenedalf Kofler2011-PoPoolation`
Started 2024-03-01 16:55:26
Reading reference genome fasta
Reference genome contains 611 chromosomes
Sample filter results in 423 samples out of 424 being used.
Applying union of 1 region filters, which in total leads to filtering for 1 chromosome, with 1 full chromosome(s)
Processing 2 samples
At chromosome .chr1
Processed 1 chromosome with 12540819 (non-filtered) positions in 133 windows.
Finished 2024-03-01 21:39:20
Hi Vinicius @vilperte,
okay, I understand what's happening now, and was able to reproduce your issue with my test data by using the new sample group merging option. Due to the sample merging, you are now getting artificial samples with nucleotide counts that are way higher than what I tested on before with real data. The diversity estimators have a term in their equations that depends on the coverage, which explains the long run time. Furthermore, there was a caching of intermediate values to avoid re-computation that was based on coverage - for each coverage level C, it stored C^2 values, which explains the exorbitant memory usage. That just never showed up before with real data (without the sample merging - thus with lower coverage), and hence slipped my attention.
I have now revisited both issues, and solved them on the dev
branch. For the memory usage, it turned out that these cached values are not necessary (the caching is already done at a higher level in the computation anyway), so that completely solved the memory usage with my test data.
And as for the computation time that depends on coverage: There is now an option --subsample-max-coverage
in the diversity
command to reduce high coverage positions. I have followed the example of PoPoolation here, which has commands to subsample the coverage (see also this recent issue here), and added three different methods for this (which can be selected via --subsample-method
): Simple scaling (I think, I'd recommend that one, and it's the default now), as well as re-sampling with and without replacement. I am not entirely sure why anyone would want resampling from a distribution based on the counts, when a simple linear scaling does the job of limiting coverage while maintaining the allele frequencies... but well, it's what PoPoolation does, so I thought it might be useful. Maybe @jeffspence knows? Was honestly a bit of a rabbit hole to get the resampling distributions right, no idea if that was needed and might be overkill, but it's there now ;-)
Note that this subsampling is meant as a way to keep positions with high coverage without having to remove them, but also without suffering from long runtimes. There is also the option --filter-sample-max-coverage
, which is meant to remove positions with spuriously high coverage. These options will soon both be usable at the same time, that is, one could filter out everything above, say, 25k coverage, and resample everything that remains to a max of 10k. However, unfortunately, the new subsampling option comes at a time at which I am restructuring the code around all the filtering, and so for now, it takes precedence over the existing max coverage filter, which will hence not trigger, because the coverage will have been reduced to 10k before the filtering for 25k is done... As you are not using the --filter-sample-max-coverage
filter, that should be okay for you, and I didn't want to let you wait too long for the new option, but you should be aware of that. It will be fixed in the next couple of weeks, I hope!
Please get the latest dev
branch, and test if that solves your issues! Let me know what you think!
Cheers and so long Lucas
PS: Using 16 threads in your case will not do much. You are reading from a single file, and you are merging everything into only two groups, so the potential for parallelization is not big. I think if this is cluster time that you need to pay for, using 4 threads should suffice in your case.
PPS: Parsing VCF is rather slow. As you are reading multiple times, it might make sense to convert to sync once. That should be way faster in the end.
Hi @lczech,
I have tested a bit more the new parameter and it works smoothly. I also tried passing two distinct vcfs and using the --sample-group-merge-table-file
and it works well.
I do have one question when using this parameter. At the moment, I am setting --pool-sizes
as the individual value for each sample (10 in my case). I was wondering if that is the correct way or should it be adjusted to reflect the final merged sample (e.g. 10 samples to be merged with pool size of 10 = --pool-sizes 100
)?
Thanks for the help.
Hi @vilperte,
I have tested a bit more the new parameter and it works smoothly. I also tried passing two distinct vcfs and using the --sample-group-merge-table-file and it works well.
Nice, happy to hear! No more high memory usage then? Even without the sub-sampling, it should be good now. If you have only tested with sub-sampling, would you mind briefly checking that the memory is not as high any more even without sub-sampling? Just want to make sure that I actually fixed your issue on your data, and not just mine :-)
I do have one question when using this parameter. At the moment, I am setting --pool-sizes as the individual value for each sample (10 in my case). I was wondering if that is the correct way or should it be adjusted to reflect the final merged sample (e.g. 10 samples to be merged with pool size of 10 = --pool-sizes 100)?
Ah, if you are just using --pool-sizes
with a single value, that value is applied to all final samples that are used in the computation, without modification. Good point, I'll amend the documentation of the --sample-group-merge-table-file
option to clarify that. So, you'd need to use the sum of all pool sizes for each merged group. I assume that not all groups have the same number of samples in them, and hence not the same sum of pool sizes. Then it is probably easiest by providing a file instead of a single number for --pool-sizes
. That file just needs to contain the group names (same as in the --sample-group-merge-table-file
) and the sum of pool sizes of the group.
Hope that helps, and so long Lucas
I actually tested without the sub-sampling and no high RAM spikes ;)
I expected that to be the case for the pool sizes, even tested it, but thought I should clarify it. Thanks again for the quick answer.
Cheers,
Vinicius
Awesome, thanks for the confirmation!
Shall we close this issue then? Feel free to re-open if needed, or start a new one if you have any further questions or suggestions :-)
Hi,
I am working with a dataset of pooled natural occurring populations which have been processed in a variant calling (gatk). I have already calculated the different diversity measures per-sample, but I was wondering if it would be possible to create groups of samples (e.g. 10 samples for group A and 10 samples for group B) and then calculate per-group diversity measures?
It would of course be possible to merge on fastq or bam level, but that would then lead to lots of extra preprocessing.
Thank you.