Closed SergejN closed 1 year ago
Got it, I will also fix this. Thanks for these, it helps!
FYI you can use a larger reference by specifying -I 8G
for example in minimap2.
Out of curiosity what is your use case?
thanks! I didn't know about that option. In this particular case, it's the Mexican axolotl, therefore, the genome is 32Gb
Ahh that beast. What is your total coverage and are you working with the Marvel(not sure I am remembering that right?) assembly?
If so it might be worth trying Shasta, at least in human assemblies it has improved repeat resolution in the samples I have seen.
Thanks! I'll take a look.
Along those lines, I'm not sure I should open yet another issue because it is something that only affects me and a bunch of other crazy guys working with large genomes probably, but the version of samtools
(1.9) does not support proper sorting of BAM files (samtools sort
). If I remember correctly, they only fixed it in v1.10
. Could you also update samtools
? I did it locally by replacing the binary.
You can leave it here, and I will try and get around to it.
Sorry about all these issues.... it is kinda clear that myself and others have not been using SDA because of HiFi...
I asked about your coverage because if you only have ~30X or so, then I would not bother running SDA as it requires ~40X and ideally ~50X.
No worries, it's all fine. I often run into problems like this and I am very glad if the author is responsive. So kudos to you! As for the coverage - right now, I had a single test file with ~1M reads in it, so the coverage is way way below 30X :) the idea was to make a test run and see what happens in terms of resources, runtime and so on.
Alright I pushed some changes to master that should fix some of the issues: 1) minimap2 now takes (fa/fq/fasta/fastq)(.gz)? 2) repeatmasker now uses lib if species is a file 3) the size of the index with -I in minimap2 is set depending on the size of the reference 4) I merged your sbatch suggestion
With regardess to samtools, I am not going to update now because it may change a lot of dependancies. However what I have done is change the input for SDA denovo
from SDA denovo --fofn {fofn} ....
to SDA denovo --input {fofn|bam} ....
. So you can now pass a pre aligned and sorted bam and skip the alignments steps in SDA to avoid the samtools sort issue. Not a fix yet, but it may be helpful, particularly if you have already done ONT alignments for your 32GB genome.
Awesome! thanks a lot!
No problem, thanks for hunting these things down!
I am not going to close anything yet, until I run a full set of tests.
Also I am relatively certain that your genome will break all my resource request presets. So it is on my TODO list to implement a resource configuration file that can be modified for larger genomes.
That would be very nice! It's much easier to change resource requirements in a separate file rather than looking for them in the pipeline code.
Dear @mrvollger ,
sorry for yet another feature request / bug report. Currently, SDA does not support reference genomes larger than 4Gb, although
minimap2
has the switch--split-prefix
that allows it to handle such references. I would suggest to either check if the overall reference length is greater than 4Gb and then set the switch or set it globally for any reference. However, according to theminimap2
authors, this switch makes it impossible to set the MD flag in the BAM file (see https://github.com/lh3/minimap2/issues/548), if that's required at some point. Either way, I solved it by adding the switch to the rule as follows around here: https://github.com/mrvollger/SDA/blob/1fbe948f3d8cde6ae6b8c49b33f4220053755718/SDA.smk#L111best Sergej