Open luizirber opened 7 years ago
Is this the reason for the SBT behaviour in #133? I.e., that sbt_search returns the same signature multiple times?
@viehwegerlib it's possible (@ctb didn't send me the files for testing yet =P)
shame >> @ctb
;)
@luizirber should I send you the testfiles? (would need some address)
I think this is not the behavior we expect, at least not from the command line :). Perhaps the Python API should balk at overwriting an existing file?
Also, if two signatures are identical, we should not insert them twice (from the command line) but rather spit out a warning.
+1
This appears to still happen:
% sourmash index -k 31 blah podar-ref/1.fa.sig podar-ref/1.fa.sig
...
== This is sourmash version 3.2.3.dev4+g0362ac3e. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading 2 files into SBT
loaded 2 sigs; saving SBT under "blah"
% jq . < blah.sbt.json
{
"d": 2,
"version": 4,
"storage": {
"backend": "FSStorage",
"args": {
"path": ".sbt.blah"
}
},
"factory": {
"class": "GraphFactory",
"args": [
1,
100000,
4
]
},
"nodes": {
"0": {
"filename": "internal.0",
"name": "internal.0",
"metadata": {
"min_n_below": 1478
}
},
"1": {
"filename": "c11126d0591db94cd3d1c8568499375f",
"name": "c11126d0591db94cd3d1c8568499375f",
"metadata": "c11126d0591db94cd3d1c8568499375f"
},
"2": {
"filename": "c11126d0591db94cd3d1c8568499375f",
"name": "c11126d0591db94cd3d1c8568499375f",
"metadata": "c11126d0591db94cd3d1c8568499375f"
}
}
}
% ls -al .sbt.blah/
total 160
drwxr-xr-x 4 t staff 128 Apr 15 07:36 .
drwxr-xr-x 322 t staff 10304 Apr 15 07:36 ..
-rw-r--r-- 1 t staff 26036 Apr 15 07:36 c11126d0591db94cd3d1c8568499375f
-rw-r--r-- 1 t staff 50042 Apr 15 07:36 internal.0
However, search does not find it more than once --
% sourmash search podar-ref/1.fa.sig blah
== This is sourmash version 3.2.3.dev4+g0362ac3e. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
selecting default query k=31.
loaded query: CP001941.1 Aciduliprofundum bo... (k=31, DNA)
loaded 1 databases.
1 matches:
similarity match
---------- -----
100.0% CP001941.1 Aciduliprofundum boonei T469, complete genome
This is due to code introduced in #556 (the Index
base class refactor) that collapses repeat md5s across a database search - see code.
See also #884 which fixes a separate problem I introduced :eyes: where the filename was based not on signature md5sum but rather on signature name, so if you had two different signatures with the same name, one overwrote the other.
is this now resolved by https://github.com/dib-lab/sourmash/pull/994?
SBT allows repeatedly inserting the same leaf:
The generated JSON
A graphical representation:
Note that we have a DAG in this case, not a tree anymore... The figure is misleading, in fact there memory representation will be more like this: but the content of each
name
node will come from the same signature.This is the content of
.sbt.test
:The question is: is this the behavior we expect?