Sample grouping for diversity calculation

vilperte commented 8 months ago

Hi,

I am working with a dataset of pooled natural occurring populations which have been processed in a variant calling (gatk). I have already calculated the different diversity measures per-sample, but I was wondering if it would be possible to create groups of samples (e.g. 10 samples for group A and 10 samples for group B) and then calculate per-group diversity measures?

It would of course be possible to merge on fastq or bam level, but that would then lead to lots of extra preprocessing.

Thank you.

lczech commented 8 months ago

Hi @vilperte,

so, if I am understanding you correctly, you simply want to merge the per-position nucleotide counts of several files into one (by adding up the numbers), in order to get the same effect that merging the fastqs would have? That is a command that I intended to write anyway, and can prioritize that now, if there is need ;-)

Also, I would recommend to not use variant callers on your data. See our preprint, where we examine the effects of using typical variant callers (that are written with the statistics of individual-based sequencing in mind) on pooled sequencing data. They give you some skews in the counts due to their statistical processing. Instead, you can simply process the mapped files (bam or the like) directly, and avoid this intermediate step - that should give cleaner results. Or is there a particular reason why you want to use (individual-based) variant callers?

Cheers and so long Lucas

vilperte commented 8 months ago

Hi Lucas,

Thanks for the reply. In principle it is one multi-sample vcf, and I would like to create groups of those samples (e.g. group1 contains samples A, B and C, and group2 contains samples D, E and F). The advantage there is that I could create different groups on demand, and calculate the different diversity measures, or even calculate FST between groups. If I work on BAM level, I would have to keep merging the bam files and generate a huge amount of tmp data.

I have processed the data using the pooled option in GATK (although optimized for human data) and I am now comparing the vcf based versus bam based results from grenedalf. But I am aware that the results from mapping files are more accurate.

lczech commented 8 months ago

Oh I see, so you particularly want to avoid creating files. My suggestion to have a command for merging files into a new one then won't work?

Well, so instead of files, doing this "on the fly" while reading the files is of course perfectly possible, and not even that hard to implement in my code base. However, the issue that I see is how to elegantly provide a command line interface for that, so that users are not overwhelmed or confused by that... I guess, a simple table file might work, with two columns, one for the sample names of your input, and one where each name is assigned to a group name, which will then be used as the new name for that merged "sample". What do you think? Seems useful to me! But that would be a bit more work to implement - I'll see when I get to this then.

vilperte commented 8 months ago

The table options seems great and super easy to handle, it would definitely be useful. If I understood well, it would look similar to the table below:

Group	Sample_Name
GroupA	Sample1
GroupA	Sample2
GroupA	Sample3
GroupB	Sample4
GroupB	Sample5
GroupB	Sample6

And the output of running grenedalf diversity would look like.

chrom	start	end	GroupA.snp_count	GroupA.coverage_fraction	GroupA.theta_pi_abs	GroupA.theta_pi_rel	GroupA.theta_watterson_abs	GroupA.theta_watterson_rel	GroupA.tajimas_d	GroupB.snp_count	GroupB.coverage_fraction	GroupB.theta_pi_abs	GroupB.theta_pi_rel	GroupB.theta_watterson_abs	GroupB.theta_watterson_rel	GroupB.tajimas_d
chr1	10001	20000	28	0.081	18.274612	0.0225056798	21.4851791	0.0264595802	-0.0934498091	165	0.711	142.083395	0.0199892228	142.364211	0.0200287298	-0.00123580431
chr1	20001	30000	124	0.278	51.7982791	0.0186593224	66.8736422	0.0240899288	-0.141222423	345	0.987	165.046393	0.0167152514	174.814639	0.0177045411	-0.0350090809
chr1	30001	40000	123	0.420	61.0260734	0.0145438688	71.1787826	0.0169634849	-0.0893563396	184	0.981	83.4635721	0.0085071422	90.6548843	0.00924012683	-0.0496978445

lczech commented 8 months ago

Exactly! Would that solve your issue? I thought a bit more about this, and can probably implement that without too much effort or computational overhead ;-)

vilperte commented 8 months ago

Yes, that would work pretty well :)

lczech commented 8 months ago

Okay, done. The dev branch now provides an option --sample-group-merge-table-file as discussed above. Please check it out, and let me know if that works for you!

(Small warning: I am restructuring the internal code quite a bit at the moment, so I hope I didn't break anything. I briefly tested, and it works fine, but let me know should you encounter any trouble. More thorough testing will follow once the restructuring is complete)

vilperte commented 8 months ago

Hi Lucas,

Thanks for implementing it so quick, that's great! I will test it in the next days and report it back ;)

vilperte commented 8 months ago

Hi again Lucas,

I have tested the merging option and it worked pretty well. The only thing I noticed is the huge amount of RAM needed.

I have currently a bcf with 400 samples (~144M variants) and, in the standard mode, it took 8 hours to run the diversity subcommand (--measure all option), with 16 threads and around 10GB of RAM, and with measures for all 400 samples.

For the same bcf, I used the option --sample-group-merge-table-file to create two groups with around 200 samples per group and it was not able to run with the full bcf. After splitting the calculations per chromosome (each chromosome with around 60MB), it managed to finish with an average of 5 hours per chromosome, with 16 threads and 150GB of RAM.

Maybe it would be a good idea to mention in the parameter details that a large amount of RAM might be needed.

Thanks a lot for the implementation, I will use it more frequently now.

Cheers,

Vinicius

lczech commented 8 months ago

Hi @vilperte,

thanks for reporting this - that is super weird, as the whole thing is written to be highly optimized for RAM usage. Can you please send me the exact command that you ran? Might be a bug, or some other thing that I might have missed.

Cheers Lucas

vilperte commented 8 months ago

Command

grenedalf_dev/bin/grenedalf diversity --vcf-path input.vcf \
    --reference-genome-fasta-file reference.fasta \
    --filter-region chr1 \
    --filter-samples-exclude sampleX \
    --sample-group-merge-table-file sample_groups.txt \
    --window-type sliding \ 
    --window-sliding-width 500000 \
    --pool-sizes 10 \
    --filter-sample-min-count 2 \
    --filter-sample-min-coverage 2 \
    --measure all \
    --file-prefix merged_chr1_ \
    --out-dir results/merged \
    --threads 16 \
    --log-file merged_chr1.log

Output log


                                                 ,--;                           
                                                 \  \          ----.            
    ,-----.           ,----.;---.  ,--.  ,----.   \  \   ---.   |  |    ;-.----.
   /  ---./ ;-,----. / .---'|  , `.|  | / .---' ,--'  \   \  \  |  |    | .---' 
  |  ( .---.|  ´` .'|  `--  |  |`  `  ||  `--  / .--.  )/ .-. \ |  |    | `--,  
   \  --'  ||  |\  \ \ `---.|  |  \   | \ `---.\ `--' //  `-'  \|  '---.| |`    
    `----´ '`--' '--' `----'`--'   `--:  `----' `----' `------'.`------'| |     
                                                                        | |     
       ========================================================////-    ,-'     
       v0.3.0 (c) 2020-2024 by Lucas Czech

Command:                           grenedalf diversity

Input SAM/BAM/CRAM:
  --sam-path                       []
  --sam-min-map-qual               0
  --sam-min-base-qual              0
  --sam-split-by-rg                
  --sam-flags-include-all          
  --sam-flags-include-any          
  --sam-flags-exclude-all          
  --sam-flags-exclude-any

Input (m)pileup:
  --pileup-path                    []
  --pileup-min-base-qual           0
  --pileup-quality-encoding        sanger

Input sync:
  --sync-path                      []

Input VCF/BCF:
  --vcf-path                       "input.bcf"

Input frequency table:
  --frequency-table-path           []
  --frequency-table-separator-char comma
  --frequency-table-int-factor     0
  --frequency-table-freq-is-ref    
  --frequency-table-chr-column     
  --frequency-table-pos-column     
  --frequency-table-ref-base-column
  --frequency-table-alt-base-column
  --frequency-table-sample-ref-count-column
  --frequency-table-sample-alt-count-column
  --frequency-table-sample-freq-column
  --frequency-table-sample-cov-column

Input Settings:
  --multi-file-locus-set           union
  --reference-genome-fasta-file    "reference.fasta"
  --reference-genome-dict-file     
  --reference-genome-fai-file

Sample Names, Groups, and Filters:
  --rename-samples-file            
  --filter-samples-include         
  --filter-samples-exclude         "sampleX"
  --sample-group-merge-table-file  "sample_groups.txt"

Region Filters:
  --filter-region                  "chr1"
  --filter-region-list             []
  --filter-region-bed              []
  --filter-region-gff              []
  --filter-region-map-bim          []
  --filter-region-vcf              []
  --filter-region-set              union

Numerical Filters:
  --filter-sample-min-count        2
  --filter-sample-max-count        0
  --filter-sample-tolerate-deletions
  --filter-sample-min-coverage     2
  --filter-sample-max-coverage     0

Window:
  --window-type                    "sliding"
  --window-sliding-width           500000
  --window-sliding-stride          0
  --window-queue-count             0
  --window-queue-stride            0
  --window-region                  []
  --window-region-list             []
  --window-region-bed              []
  --window-region-gff              []
  --window-region-skip-empty

Settings:
  --pool-sizes                     10
  --measure                        "all"

Formatting:
  --separator-char                 comma
  --na-entry                       nan

Compatibility:
  --popoolation-corrected-tajimas-d
  --popoolation-format

Output:
  --out-dir                        "results/merged"
  --file-prefix                    "merged_chr1_"
  --file-suffix                    
  --compress

Global Options:
  --allow-file-overwriting         
  --verbose                        
  --threads                        16
  --log-file                       "merged_chr1.log"

Run the following command to get the references that need to be cited:
`grenedalf citation Czech2023-grenedalf Kofler2011-PoPoolation`

Started 2024-03-01 16:55:26

Reading reference genome fasta
Reference genome contains 611 chromosomes
Sample filter results in 423 samples out of 424 being used.
Applying union of 1 region filters, which in total leads to filtering for 1 chromosome, with 1 full chromosome(s)
Processing 2 samples

At chromosome .chr1

Processed 1 chromosome with 12540819 (non-filtered) positions in 133 windows.

Finished 2024-03-01 21:39:20

lczech commented 8 months ago

Hi Vinicius @vilperte,

okay, I understand what's happening now, and was able to reproduce your issue with my test data by using the new sample group merging option. Due to the sample merging, you are now getting artificial samples with nucleotide counts that are way higher than what I tested on before with real data. The diversity estimators have a term in their equations that depends on the coverage, which explains the long run time. Furthermore, there was a caching of intermediate values to avoid re-computation that was based on coverage - for each coverage level C, it stored C^2 values, which explains the exorbitant memory usage. That just never showed up before with real data (without the sample merging - thus with lower coverage), and hence slipped my attention.

I have now revisited both issues, and solved them on the dev branch. For the memory usage, it turned out that these cached values are not necessary (the caching is already done at a higher level in the computation anyway), so that completely solved the memory usage with my test data.

And as for the computation time that depends on coverage: There is now an option --subsample-max-coverage in the diversity command to reduce high coverage positions. I have followed the example of PoPoolation here, which has commands to subsample the coverage (see also this recent issue here), and added three different methods for this (which can be selected via --subsample-method): Simple scaling (I think, I'd recommend that one, and it's the default now), as well as re-sampling with and without replacement. I am not entirely sure why anyone would want resampling from a distribution based on the counts, when a simple linear scaling does the job of limiting coverage while maintaining the allele frequencies... but well, it's what PoPoolation does, so I thought it might be useful. Maybe @jeffspence knows? Was honestly a bit of a rabbit hole to get the resampling distributions right, no idea if that was needed and might be overkill, but it's there now ;-)

Note that this subsampling is meant as a way to keep positions with high coverage without having to remove them, but also without suffering from long runtimes. There is also the option --filter-sample-max-coverage, which is meant to remove positions with spuriously high coverage. These options will soon both be usable at the same time, that is, one could filter out everything above, say, 25k coverage, and resample everything that remains to a max of 10k. However, unfortunately, the new subsampling option comes at a time at which I am restructuring the code around all the filtering, and so for now, it takes precedence over the existing max coverage filter, which will hence not trigger, because the coverage will have been reduced to 10k before the filtering for 25k is done... As you are not using the --filter-sample-max-coverage filter, that should be okay for you, and I didn't want to let you wait too long for the new option, but you should be aware of that. It will be fixed in the next couple of weeks, I hope!

Please get the latest dev branch, and test if that solves your issues! Let me know what you think!

Cheers and so long Lucas

PS: Using 16 threads in your case will not do much. You are reading from a single file, and you are merging everything into only two groups, so the potential for parallelization is not big. I think if this is cluster time that you need to pay for, using 4 threads should suffice in your case.

PPS: Parsing VCF is rather slow. As you are reading multiple times, it might make sense to convert to sync once. That should be way faster in the end.

vilperte commented 7 months ago

Hi @lczech,

I have tested a bit more the new parameter and it works smoothly. I also tried passing two distinct vcfs and using the --sample-group-merge-table-file and it works well.

I do have one question when using this parameter. At the moment, I am setting --pool-sizes as the individual value for each sample (10 in my case). I was wondering if that is the correct way or should it be adjusted to reflect the final merged sample (e.g. 10 samples to be merged with pool size of 10 = --pool-sizes 100)?

Thanks for the help.

lczech commented 7 months ago

Hi @vilperte,

I have tested a bit more the new parameter and it works smoothly. I also tried passing two distinct vcfs and using the --sample-group-merge-table-file and it works well.

Nice, happy to hear! No more high memory usage then? Even without the sub-sampling, it should be good now. If you have only tested with sub-sampling, would you mind briefly checking that the memory is not as high any more even without sub-sampling? Just want to make sure that I actually fixed your issue on your data, and not just mine :-)

I do have one question when using this parameter. At the moment, I am setting --pool-sizes as the individual value for each sample (10 in my case). I was wondering if that is the correct way or should it be adjusted to reflect the final merged sample (e.g. 10 samples to be merged with pool size of 10 = --pool-sizes 100)?

Ah, if you are just using --pool-sizes with a single value, that value is applied to all final samples that are used in the computation, without modification. Good point, I'll amend the documentation of the --sample-group-merge-table-file option to clarify that. So, you'd need to use the sum of all pool sizes for each merged group. I assume that not all groups have the same number of samples in them, and hence not the same sum of pool sizes. Then it is probably easiest by providing a file instead of a single number for --pool-sizes. That file just needs to contain the group names (same as in the --sample-group-merge-table-file) and the sum of pool sizes of the group.

Hope that helps, and so long Lucas

vilperte commented 7 months ago

I actually tested without the sub-sampling and no high RAM spikes ;)

I expected that to be the case for the pool sizes, even tested it, but thought I should clarify it. Thanks again for the quick answer.

Cheers,

Vinicius

lczech commented 7 months ago

Awesome, thanks for the confirmation!

Shall we close this issue then? Feel free to re-open if needed, or start a new one if you have any further questions or suggestions :-)

lczech / grenedalf

Sample grouping for diversity calculation #18