ianholmeslab / bilby_encoder

BED-FASTA-BAM one-hot data encoder method utilizing state transition tuples.
0 stars 3 forks source link

Add Support for Genomes across FASTA files #13

Open ritster opened 3 days ago

ritster commented 3 days ago

Allow for multiple fasta files to be specified by the user, each containing a piece of the reference genome. Memory is an important consideration in making this change, as the current implementation creates an in-memory dictionary of IDs: sequences for each FASTA entry, which will be intractable for small machines running this code on large genomes.

cmdcolin commented 3 days ago

not sure if it helps, but oftentimes there is the idea of using 'indexed fasta' to chop out a genomic region of a larger fasta file https://pypi.org/project/pyfaidx/ ("faidx" comes from the name "samtools faidx" for fasta index). this could be an alternative to needing to accept multiple fasta files