sanger-tol / readmapping

Nextflow DSL2 pipeline to align short and long reads to genome assembly. This workflow is part of the Tree of Life production suite.
https://pipelines.tol.sanger.ac.uk/readmapping
MIT License
8 stars 6 forks source link

Make sure we are handling CRAM files correctly #111

Closed tkchafin closed 5 days ago

tkchafin commented 3 weeks ago

Description of feature

For any steps requiring a reference, we should be passing it or using REF_CACHE.

---- Some notes from JIRA ticket TOLIT-1916 on this topic ----

For using CRAM with Samtools there are a number of options:

  1. We can embed the reference information in the file itself (embed_ref) -- with also a few "sub-options" therein on how much information is embedded
  2. We can pass the reference alongside the file, which will require Samtools to then do some repeated computations on the file (possibly just re-doing the MD5 checksums in most cases but I am not sure) (-T/--reference)
  3. Pre-compute checksums on the genome and cache this information in REF_CACHE
  4. Barring these, Samtools will first check the path to reference in the header, and if none is found will attempt to download the reference from ENA.

Following some conversation over in Jira (TOLIT-1916), we decided to forgo option 1 as it means storing redundant reference information across files (in part defeating the purpose of using CRAM over BAM), and that we also do not like option 4 because (1) it can take a very long time and (2) server reliability issues.

So, we did some benchmarking of simple commands (sort, view, etc) and we found that for many operations, option 2 and 3 have nearly identical runtimes, with exception of some cases which presumably involve a LOT of reference lookups:

"I have done some tests comparing several samtools operations under the cases of a pre-computed cached reference vs. supplying a reference at runtime. For most common/simple operations (e.g., sort, view), it didn't really seem to matter... However for mpileup (which presumably involved a lot of repeated lookups against the reference?), it can make a MASSIVE difference (>100X faster to use a cache). I'm not certain why, but it seems that for most of samtools use cases we can likely get away with just passing the reference and ignoring the cache. — For the initial questions: "Do we really need to pre-compute the .dict file and pass it ? Why is samtools not doing it by itself ?" – No need to precompute "Do we need to run seq_cache_populate.pl ?" – If we want to use the cached/pre-processed reference option then yes, we could wrap this script in a nextflow module (would be a good candidate to add to nf-core/samtools if not there already), or check a given path (passed in config or taken from $REF_PATH/$REF_CACHE) to see if this has already been done "What REF_PATH do we need to set ? Note: in my experience, the automatic pulling from ENA is unreliable: either slow or down" – IMO we should mostly just ignore the remote download option and either use the REF_CACHE or pass the genome explicitly with -T/-reference. Samtools will also (I think) try to look up the genome if a local path is present in the header before falling back to a remote lookup, but we can't use Nextflow's file staging if we only pass the reference this way so it is better to pass it explicitly. For most of the tests I did, setting the local cache as REF_CACHE was equally efficient to passing the genome with -T/ -reference — In terms of "best practice" for using CRAM, I think the only reasonable options are to use the cache with REF_PATH/CACHE or pass the reference alongside each call on a CRAM file (for operations needing the reference). Whether or not these two options make a difference depends on the exact operation, which simple operations at least seeming not to be affected. As for the alternatives, I think allowing samtools to check the server is a bad option, and we also don't want to use the embed_ref options because this results in (1) redundant disc usage across files and (2) doesn't have any useful support in ENA for upload purposes anyways. So, most cases we probably can pass the reference and not bother with cache, but for instances where this is actually faster, we can build the cache in a module (if genome not already cached; otherwise just grab its path) and pass that (like is done [here|https://github.com/epi2me-labs/wf-basecalling/blob/master/main.nf]) "

--

SO, remaining things to do/ questions are:

  1. Are we certain that we are NOT allowing any current Samtools CRAM functions to attempt accessing the server (and is there a way to disable this)
  2. Is it worthwhile attempting to cache the genome and pass with REF_CACHE vs. passing directly? It would be good to get a better idea of what steps exactly Samtools is doing when you pass a reference (with -T/--reference) in the case of sort (same runtime as REF_CACHE) vs. mpileup (massively worse runtime).
  3. Ensure that we have implemented whatever determination is made in step 2

Given the operations done here likely have no advantage of implementing the cache, I suspect that we will be fine to pass the reference directly with -T/--reference, but it might be worth repeating the benchmarking tests on a larger genome (and also digging a bit deeper into what exactly Samtools is doing). If that ends up being the case, we already have ch_fasta so will just need to check we are passing this where needed

tkchafin commented 3 weeks ago

samtools_analysis.pdf

tkchafin commented 3 weeks ago

@reichan1998 As discussed in the last meeting, I have moved some discussion here on how we are handling CRAM files in the pipeline

reichan1998 commented 3 weeks ago

Thank you for moving this discussion over @tkchafin.

The current approach seems solid, especially with the focus on avoiding embedded references and using local paths or cached data. I agree that ensuring Samtools doesn’t attempt to access reference data from external servers is critical for maintaining efficiency and reliability. Regarding the next steps, I’ll double-check that no operations are falling back to server lookups. If there’s a way to explicitly disable this, we should definitely implement it. I’ll also investigate exactly what steps Samtools takes in the case of sort vs. mpileup, as mentioned. I agree with you on the importance of larger genome benchmarks—if we see any significant differences, we might need to adjust our strategy accordingly.

tkchafin commented 3 weeks ago

cram_tests.tar.gz here is the simple script I used to do some testing

reichan1998 commented 3 weeks ago

Thank you very much, Tyler! I will take a look at the script and run some tests on my end as well. I will get back to you with any findings or if I have any questions.

tkchafin commented 2 weeks ago

Hi @reichan1998 just wanted to confirm, you determined that no changes are needed here right?

reichan1998 commented 2 weeks ago

Yes I believe our approach to handling CRAM files is fine now, as no operations are likely to fall back to server lookups. Since the operations we're performing likely do not benefit from using the cache, it might be better to pass the reference directly using the -T/--referenceoption. Additionally, Samtools mpileup calculates Base Alignment Quality (BAQ) when a reference file is provided with the -f/--reference option, as explained here. This might cause the significant increase in runtime during mpileup. To disable BAQ calculation, the -B option should be used. Therefore, setting the local cache as REF_CACHE might be equally efficient to passing the genome with -T/ --reference. samtools_analysis_Chau.pdf

tkchafin commented 2 weeks ago

Excellent work, thanks for that (and for catching my mistake with mpileup!)

tkchafin commented 2 weeks ago

You mentioned that you have tried turning off the remote lookup behaviour by spoofing REF_PATH? I wonder if we can set this somehow in modules.config, maybe with beforeScript (https://www.nextflow.io/docs/latest/process.html#beforescript)?

reichan1998 commented 2 weeks ago

It's not your mistake at all, as --reference works just fine! Regarding REF_PATH, I temporarily modified the environment variables, setting REF_PATH or REF_CACHE to point to an invalid directory, and see if Samtools would throw an error indicating it couldn't find the reference. Since no error was thrown, I suspect that Samtools might not be using the remote lookup behavior. However, I will look into beforeScript and see what we can do.