Closed tmaklin closed 1 year ago
I forgot to mention that this happens after fulgor enters the following step:
building minimizers MPHF (PTHash) with 8 threads...
Hi, yes it means there are more than 1B minimizer for m=20, hence the minimal perfect hash function use in SSHash (which is PTHash) would require 128-bit hash codes to be safe. It should be enough to change the default hasher in SSHash for minimizers, here: https://github.com/jermp/sshash/blob/master/include/hash_util.hpp#L51.
Should you have some update on the Themisto index, I'd love to know. Thanks!
thanks for the quick reply! I think I changed the wrong line (https://github.com/jermp/sshash/blob/master/include/kmer.hpp#L6), will let you know if this works.
Just testing fulgor vs themisto v3 with mSWEEP for a certain downstream application for now.
Thanks for the info @tmaklin :). One thing (of which I'm certain you are aware) is that the current approaches for reporting the pseudoalignment results are far from space optimal.
I'm aware this is something you've worked on in alignment-writer. If it makes sense to design and converge on a more standard and compact output format for these tools, this is something that we'd certainly be interested in. For example, we have a binary format (the RAD) format that we use in alevin-fry that addresses a variant of this problem. However, it would be nice to generalize and to understand what information makes sense for different use cases.
Cheers, Rob
Indexing the data overnight worked so closing this. Thanks!
Re the file format, a standardized and compact output for all the different tools sounds great! Alignment-writer (think I need to figure out a better name) is a wrapper around BitMagic and achieves anything from 10x to 100x compression on the test cases I used while developing but the efficiency naturally depends on the complexity of the alignment. I'll be adding support for the format fulgor currently uses soon.
Some issues I've noticed with the formats while developing and using the various tools, roughly in order of headaches caused
Would be nice to discuss/work on this at some point.
Hi @tmaklin, yes, a common alignment format would be very nice. Happy to work on this together if you like.
Likewise. I have some thoughts on this as well :). Shall we open a separate issue for it? Or, even better, @jermp, if you can enable “discussions” on this repo.
Good idea, discussions enabled!
Hi, I'm trying to index a large and diverse collection of bacterial genomes (~300k sequences from ~4000 different species) but keep getting the following error:
I found a fix for this in sshash (https://github.com/jermp/sshash/issues/16) but even after uncommenting the relevant line in a fresh clone of fulgor after running
git submodule update --init --recursive
I keep getting the same error.