Support for large references

SergejN commented 3 years ago

Dear @mrvollger ,

sorry for yet another feature request / bug report. Currently, SDA does not support reference genomes larger than 4Gb, although minimap2 has the switch --split-prefix that allows it to handle such references. I would suggest to either check if the overall reference length is greater than 4Gb and then set the switch or set it globally for any reference. However, according to the minimap2 authors, this switch makes it impossible to set the MD flag in the BAM file (see https://github.com/lh3/minimap2/issues/548), if that's required at some point. Either way, I solved it by adding the switch to the rule as follows around here: https://github.com/mrvollger/SDA/blob/1fbe948f3d8cde6ae6b8c49b33f4220053755718/SDA.smk#L111

minimap2 \
    -ax map-ont \
    --eqx -L \
    -R '@RG\\tID:MINIMAP\\tSM:FAKE_SAMPLE\\tPL:ONT' \
    -t {threads} \
    -m {MINALN} -r {BANDWIDTH} \
        --split-prefix splitref \
    {input.ref} {input.reads} | \
    samtools view -u -F 2308 - | \
    samtools sort -@ {threads} -m 4G -T {TMPDIR}/pbmm2 -o {output}

best Sergej

mrvollger commented 3 years ago

Got it, I will also fix this. Thanks for these, it helps!

FYI you can use a larger reference by specifying -I 8G for example in minimap2.

Out of curiosity what is your use case?

SergejN commented 3 years ago

thanks! I didn't know about that option. In this particular case, it's the Mexican axolotl, therefore, the genome is 32Gb

mrvollger commented 3 years ago

Ahh that beast. What is your total coverage and are you working with the Marvel(not sure I am remembering that right?) assembly?

If so it might be worth trying Shasta, at least in human assemblies it has improved repeat resolution in the samples I have seen.

SergejN commented 3 years ago

Thanks! I'll take a look.

Along those lines, I'm not sure I should open yet another issue because it is something that only affects me and a bunch of other crazy guys working with large genomes probably, but the version of samtools (1.9) does not support proper sorting of BAM files (samtools sort). If I remember correctly, they only fixed it in v1.10. Could you also update samtools? I did it locally by replacing the binary.

mrvollger commented 3 years ago

You can leave it here, and I will try and get around to it.

Sorry about all these issues.... it is kinda clear that myself and others have not been using SDA because of HiFi...

I asked about your coverage because if you only have ~30X or so, then I would not bother running SDA as it requires ~40X and ideally ~50X.

SergejN commented 3 years ago

No worries, it's all fine. I often run into problems like this and I am very glad if the author is responsive. So kudos to you! As for the coverage - right now, I had a single test file with ~1M reads in it, so the coverage is way way below 30X :) the idea was to make a test run and see what happens in terms of resources, runtime and so on.

mrvollger commented 3 years ago

Alright I pushed some changes to master that should fix some of the issues: 1) minimap2 now takes (fa/fq/fasta/fastq)(.gz)? 2) repeatmasker now uses lib if species is a file 3) the size of the index with -I in minimap2 is set depending on the size of the reference 4) I merged your sbatch suggestion

With regardess to samtools, I am not going to update now because it may change a lot of dependancies. However what I have done is change the input for SDA denovo from SDA denovo --fofn {fofn} .... to SDA denovo --input {fofn|bam} ..... So you can now pass a pre aligned and sorted bam and skip the alignments steps in SDA to avoid the samtools sort issue. Not a fix yet, but it may be helpful, particularly if you have already done ONT alignments for your 32GB genome.

SergejN commented 3 years ago

Awesome! thanks a lot!

mrvollger commented 3 years ago

No problem, thanks for hunting these things down!

I am not going to close anything yet, until I run a full set of tests.

Also I am relatively certain that your genome will break all my resource request presets. So it is on my TODO list to implement a resource configuration file that can be modified for larger genomes.

SergejN commented 3 years ago

That would be very nice! It's much easier to change resource requirements in a separate file rather than looking for them in the pipeline code.

mrvollger / SDA

Support for large references #16