pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
204 stars 25 forks source link

Query a colored graph for presence/absence of queries in each color of the graph #39

Closed ggautreau closed 3 years ago

ggautreau commented 3 years ago

Hi,

I have an issue when using Bifrost (version : 1.0.5) to query a colored graph for the presence/absence of queries in each color of the graph. Indeed, the resulting file only contains a single column although I have several colors in the graph.

Can you help me ?

Thx and happy new year :)

Files used: https://drive.google.com/drive/folders/1tnTFPk2FAVL_Ax_qupVEHCASSAEIQ07_?usp=sharing

Steps to reduce the issue: 1) Bifrost build --input-seq-file test.fa --output-file test --threads 8 --colors -v output:

KmerStream::KmerStream(): Start computing k-mer cardinality estimations
CompactedDBG::build(): Estimated number of k-mers occurring at least once: 2320
CompactedDBG::build(): Estimated number of minimizer occurring at least once: 522
CompactedDBG::build(): Estimated number of k-mers occurring twice or more: 2224
CompactedDBG::build(): Estimated number of minimizers occurring twice or more: 502
CompactedDBG::filter(): Closed all fasta/fastq files
CompactedDBG::filter(): Processed 77526 k-mers in 55 reads
CompactedDBG::filter(): Found 2235 unique k-mers
CompactedDBG::filter(): Number of blocks in Bloom filter is 16
CompactedDBG::construct(): Extract approximate unitigs
CompactedDBG::construct(): Closed all input files

CompactedDBG::construct(): Splitting unitigs (1/2)

CompactedDBG::construct(): Splitting unitigs (2/2)
CompactedDBG::construct(): Before split: 54 unitigs
CompactedDBG::construct(): After split (1/2): 53 unitigs
CompactedDBG::construct(): After split (2/2): 53 unitigs
CompactedDBG::construct(): Unitigs split: 0
CompactedDBG::construct(): Unitigs deleted: 1

CompactedDBG::construct(): Joining unitigs
CompactedDBG::construct(): After join: 51 unitigs
CompactedDBG::construct(): Joined 2 unitigs

CompactedDBG::write(): Writing graph to disk

DataStorage::write(): Writing colors to disk

2) Bifrost query -t 4 -e 0.99 -g test.gfa -f test.bfg_colors -q reads.fq -o presence_query -v output:

ColoredCDBG::read(): Reading graph.

CompactedDBG::read(): Reading graph from disk
KmerStream::KmerStream(): Start computing k-mer cardinality estimations

CompactedDBG::read(): Finished reading graph from disk
ColoredCDBG::read(): Reading colors.

DataStorage::read(): Reading color sets from disk
ColoredCDBG::read(): Joining unitigs to their color sets.
ColoredCDBG::search(): Querying graph.
CompactedDBG::search(): Found 47 queries in at least one color. 

3) head presence_query output:

query_name  test.fa
SRR2600371.3427906  0
SRR2600371.3427907  0
SRR2600371.3427908  0
SRR2600371.3427909  0
SRR2600371.3427910  0
SRR2600371.3427911  0
SRR2600371.3427912  0
SRR2600371.3427913  0
SRR2600371.3427914  0

There is only one column instead of one column for each color.

GuillaumeHolley commented 3 years ago

Hi @ggautreau,

Happy new year :)

Bifrost considers that each input file is a different color. Since your graph was built from a single FASTA file, it has only one color.

Best, Guillaume

ggautreau commented 3 years ago

Sorry, I thought each sequence entry in my input fasta file was considered as a color. Thank you for your explanation!

Best :)

GuillaumeHolley commented 3 years ago

No problem. Actually, I've been ask for this multiple times (one record = one color) so I'll add it to my todo list.

Guillaume