samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

Our FastaSequenceIndex too large to fit in memory #1630

Closed wbazant closed 1 year ago

wbazant commented 1 year ago

We are building a REST service for serving sequences to users (https://github.com/VEuPathDB/service-sequence-retrieval/) - we'd like it to provide access to all sequences across VEuPathDB sites.

Our FastaSequenceIndex objects have a large memory footprint. It's not too large, for example the index to all our protein sequences is 205MB, but it is more than we want to keep in RAM - we estimate the needs of heap memory of seven times the disk size of the index. We have a few of these, the number of sequences will still grow as VEuPathDB accrues organisms, and since the sequences are read from disk anyway, there's not much speed-up gained by an in-memory index.

We'll have to do some work so that the index can be read from disk when keyed on contig name. We can do it by subclassing FastaSequenceIndex ( public class CustomFastaSequenceIndex extends FastaSequenceIndex etc., see CustomFastaSequenceIndex.java.txt for draft) or providing a solution in htsjdk itself - extracting an interface from FastaSequenceIndex, and providing another implementation.

If we added this feature, would you consider merging - is there value in us trying to solve this problem for all htsjdk users, or is this too niche a use case? If yes, would you mind a dependency on some sort of embedded database library - our ideas so far were a SortedTableMap from MapDB, or a SQLite table?

Secondly, do you have any advice or thoughts on what would be good? We had another idea - keep the index in memory, but lower its footprint - but eventually we'd hit the limit anyway.

lbergelson commented 1 year ago

I have a bunch of questions to try to understand your use case and issue better.

  1. Is the idea that you have a very large set of FASTA files, each of which represents a single protein sequence, and you want to deliver them over a rest api to users?
    • Do you require random access to the fasta files? I.e. do people ask for 'position 50-2500 of protein X'?
      1. Is the entire set of files represented by a single fasta index in memory and a single fasta file on disk?
    • Is there any logical subdivision your files so you could keep only part of the set index in memory? (organism? )
      1. How do you look up a sequence? By chromosome? By protein name? By some sort of hash value? Are they're multiple sequences per identifier?
      2. I don't know much about protein sequence representation, is FASTA the standard way to represent the? Would protein CRAM or BAM make sense?
      3. Are you aware of the htsget and refget protocols which try to solve similar problems (distributing fasta references, vcfs, bams/crams over REST). It might make sense to try to align your api with them since they're a standard in the reference fasta space. (although sadly htsjdk doesn't yet support refget...)

And to answer your questions.

This sounds like a fairly niche use case so I dont' think we'd want to go WAY out of our way to support it, but if you're running into the problem chances are someone else out there is too, and we like to help as many use cases as we can.

I think extracting an interface from FastaSequenceIndex is a good idea rather than a direct subclass, FastaSequenceIndex isn't necessarily designed well as a base class. I would be willing to accept an extracted interface although it would be a breaking change so it might take a bit of time to work into a release.

If it was essentially interchangeable with the existing version but just loaded chunks of index lazily I don't see why we couldn't include it in htsjdk. (provided it had sufficient tests and wasn't insanely complicated.). Maybe someone else out there could benefit.

We would not accept additional dependencies in order to support this. It doesn't sound like it's anything so heavy weight that it couldn't be handled (maybe not quite as optimally) by java built ins.

One possible solution if you don't mind lots of disk access would be to sort your fasta sequence files by name and then you could implement a binary search on the index file to look something up instead of having to keep an index in memory. Another option might be to split your sequences into many independent files (assuming they're not already) and then loading the smaller indexes lazily on demand. That would obviously depend on the access patterns you want to support.

wbazant commented 1 year ago

Thank you! I'll get back to my team, and we'll think about it.

We have a few kinds of sequences - genomic sequence, proteins, some secondary stuff like ESTs - and we want to move accessing them from our per-component-site relational databases where they currently live as text blobs, to one place. We will then have our sites query that place in a micro-services sort of approach.

The API is not directly for users, they interact with a website to express queries like "give me 100 bases from the 5' end of each Plasmodium falciparum gene". Our webapp will look up the locations, express them as a list of coordinates with sequence description, pass them to the service, and from there to the user.

We'll probably use .bed as interchange format - contig, query start, query end, strand, description - in htsjdk the first three are conveniently arguments to AbstractIndexedFastaSequenceFile.getSubsequenceAt.

The naive design was a single fasta index in memory and a single fasta file on disk, per sequence type - but we'll now revisit it. But we do want the fasta format - I think CRAM/BAM is for representing alignments and want the references.

As far as solutions, I'll keep in mind that if we build it well and in Java we could then potentially include it in htsjdk. I need to think about what to actually do - I am imagining an on-disk structure that is prepared to be efficiently accessed.

wbazant commented 1 year ago

Right, we've stubbed out the index class: https://github.com/VEuPathDB/service-sequence-retrieval/blob/main/src/main/java/org/veupathdb/service/sr/reference/FastaSequenceIndexStub.java

and we have a custom source of each line in the index, which we use a SQLite database for. It's a tiny bit against the OO paradigm but I wouldn't touch the FastaSequenceIndex class on the library level - it's a sensible implementation overall, supports many things that people will want to do with the sequences.

Thanks for your input on this!

lbergelson commented 1 year ago

@wbazant I'm glad you found a good solution that works for you! Thanks for getting in touch.