umccr / holmes

BAM fingerprint stack
MIT License
3 stars 0 forks source link

Run an all-pairs check against a subset of the fingerprints #5

Closed andrewpatto closed 1 year ago

andrewpatto commented 1 year ago

Whilst Holmes is designed to meet the 1:N fingerprint check (index v all the other samples) - when it comes to investigations it would be useful to be able to get the real somalier all-pairs output.

Somalier all-pairs does not scale up well for very large numbers (10,000+) (certainly not if running on lambdas) - so there will need to be a ceiling on the number of BAMs that can be compared in all pairs mode (50?)

alexiswl commented 1 year ago

What are the memory requirements here? Somalier fingerprint files are pretty small

andrewpatto commented 1 year ago

Its possibly much much more that 50 that we can do on big lambdas now - I just remember reading some issue on somalier github where someone was saying they were running it on 10,000(?) samples and it was grinding their EC2 instances into the ground.

andrewpatto commented 1 year ago

https://github.com/brentp/somalier/issues/89

andrewpatto commented 1 year ago

@alexiswl can you get a feel for the mem/time requirements from files you have on your EC2? Would be good to understand 50 pairs = 20MB and 1:00 1000 pairs = 1GB and 3:35 (all numbers made up but you get the drift)

andrewpatto commented 1 year ago

There is now a pairs step function that returns the all pairs HTML from somalier - and takes a list of BAMs. I have not tested for what sort of number of BAMS will start to hit lambda limits. I am expecting this to be used for investigating false negatives etc (i.e < 50 bams)