iqbal-lab-org / cobs

COBS - Compact Bit-Sliced Signature Index (for Genomic k-Mer Data or q-Grams)
https://panthema.net/cobs
MIT License
16 stars 2 forks source link

COBS for 600K genomes #8

Closed davidmaimoun closed 8 months ago

davidmaimoun commented 2 years ago

Hi, I am in charge to find presence of specific genes in 600.000 Salmonella's genomes. I used COBS on few genomes for training But I don't really understand the output... I copied a subsequence (55 bp) from one of my genomes, and run COBS to see if it get it. In the output I got 24 (see bellow). And when I choose bigger sub sequence, sometimes it doesn't find it at all.

Another issue: how I can see if my query fully matchs or partially?

I ran these command: cobs compact-construct index.cobs_compact cobs query -i index.cobs_compact

--- end of document list (5 entries) --- documents: 5 minimum 31-mers: 2811023 maximum 31-mers: 2874904 average 31-mers: 2834688 total 31-mers: 14173442 DIE: Output file exists, will not overwrite without --clobber @ /opt/conda/conda-bld/cobs_1646087618998/work/cobs/construction/compact_index.cpp:213 terminate called without an active exception

SRR18349609 24 SRR18349610 24 SRR18349611 24

TIMER info=search hashes=9.929e-06 io=0.000567883 total=0.000577812

Query length 55

I'd really appreciate your help

Thank you!

davidmaimoun commented 2 years ago

And Is it possible to get more information in the output, like e/p value, location etc.. Thank you for all!!

iqbal-lab commented 2 years ago

Sorry for the slow reply, @leoisl is on vacation and I got distracted. This part of your output

DIE: Output file exists, will not overwrite without --clobber @ /opt/conda/conda-bld/cobs_1646087618998/work/cobs/constru

Suggests cobs quit early because your output file already existed (it does not want to overwriteone of your files). But the subsequent output suggests it carried on. Can I ask, does this happen if you make sure the output file does not exist?

leoisl commented 2 years ago

Hello,

output format is:

*<query_header> <number_of_hits>
<hit_1_header> <number_of_kmer_hits>
<hit_2_header> <number_of_kmer_hits>
...
<hit_n_header> <number_of_kmer_hits>

I don't think you have access to e/p-values or any other stats. You also don't have the location of these hits in the references, COBS is just to find presence of queries in a set of references. For the location and alignment, you will need to align the queries to the references, using for e.g. minimap2, but COBS can help you to filter out which references you need to align to.

PS: sorry, I didn't see this issue as I was subscribed only to https://github.com/bingmann/cobs, not this repo

davidmaimoun commented 2 years ago

Yes my output file already existed and it is working now I delete it when I run COBS again.

Thank you very much guys it was very helpful; now I know why it didn't work, and understand the output.

iqbal-lab commented 8 months ago

sorry i should have closed this.