sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

Conflicting file names when saving SBT #145

Open luizirber opened 7 years ago

luizirber commented 7 years ago

SBT allows repeatedly inserting the same leaf:

from sourmash_lib.sbt import SBT, GraphFactory
from sourmash_lib.sbtmh import SigLeaf
from sourmash_lib import signature

sig = next(signature.load_signatures('demo/SRR2060939_1.fastq.gz.sig'))
a = SigLeaf('name', sig)
tree = SBT(factory)
tree = SBT(GraphFactory(31, 1e5, 4)
tree.add_node(leaf)
tree.save('test')

The generated JSON

{
    "version": 2,
    "nodes": {
        "0": {
            "name": "internal.0",
            "filename": ".sbt.test/test.internal.0.sbt"
        },
        "1": {
            "name": "internal.1",
            "filename": ".sbt.test/test.internal.1.sbt"
        },
        "2": {
            "name": "name",
            "filename": ".sbt.test/test.name.sbt",
            "metadata": "name"
        },
        "3": {
            "name": "name",
            "filename": ".sbt.test/test.name.sbt",
            "metadata": "name"
        },
        "4": {
            "name": "name",
            "filename": ".sbt.test/test.name.sbt",
            "metadata": "name"
        }
    },
    "d": 2
} 

A graphical representation: tree

Note that we have a DAG in this case, not a tree anymore... The figure is misleading, in fact there memory representation will be more like this: tree but the content of each name node will come from the same signature.

This is the content of .sbt.test:

.sbt.test/
├── test.internal.0.sbt
├── test.internal.1.sbt
└── test.name.sbt

The question is: is this the behavior we expect?

phiweger commented 7 years ago

Is this the reason for the SBT behaviour in #133? I.e., that sbt_search returns the same signature multiple times?

luizirber commented 7 years ago

@viehwegerlib it's possible (@ctb didn't send me the files for testing yet =P)

phiweger commented 7 years ago

shame >> @ctb

;)

phiweger commented 7 years ago

@luizirber should I send you the testfiles? (would need some address)

ctb commented 7 years ago

I think this is not the behavior we expect, at least not from the command line :). Perhaps the Python API should balk at overwriting an existing file?

Also, if two signatures are identical, we should not insert them twice (from the command line) but rather spit out a warning.

phiweger commented 7 years ago

+1

ctb commented 4 years ago

This appears to still happen:

% sourmash index -k 31 blah podar-ref/1.fa.sig podar-ref/1.fa.sig
...
== This is sourmash version 3.2.3.dev4+g0362ac3e. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 2 files into SBT

loaded 2 sigs; saving SBT under "blah"

% jq . < blah.sbt.json
{
  "d": 2,
  "version": 4,
  "storage": {
    "backend": "FSStorage",
    "args": {
      "path": ".sbt.blah"
    }
  },
  "factory": {
    "class": "GraphFactory",
    "args": [
      1,
      100000,
      4
    ]
  },
  "nodes": {
    "0": {
      "filename": "internal.0",
      "name": "internal.0",
      "metadata": {
        "min_n_below": 1478
      }
    },
    "1": {
      "filename": "c11126d0591db94cd3d1c8568499375f",
      "name": "c11126d0591db94cd3d1c8568499375f",
      "metadata": "c11126d0591db94cd3d1c8568499375f"
    },
    "2": {
      "filename": "c11126d0591db94cd3d1c8568499375f",
      "name": "c11126d0591db94cd3d1c8568499375f",
      "metadata": "c11126d0591db94cd3d1c8568499375f"
    }
  }
}
% ls -al .sbt.blah/
total 160
drwxr-xr-x    4 t  staff    128 Apr 15 07:36 .
drwxr-xr-x  322 t  staff  10304 Apr 15 07:36 ..
-rw-r--r--    1 t  staff  26036 Apr 15 07:36 c11126d0591db94cd3d1c8568499375f
-rw-r--r--    1 t  staff  50042 Apr 15 07:36 internal.0
ctb commented 4 years ago

However, search does not find it more than once --

% sourmash search podar-ref/1.fa.sig blah

== This is sourmash version 3.2.3.dev4+g0362ac3e. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: CP001941.1 Aciduliprofundum bo... (k=31, DNA)
loaded 1 databases.

1 matches:
similarity   match
----------   -----
100.0%       CP001941.1 Aciduliprofundum boonei T469, complete genome

This is due to code introduced in #556 (the Index base class refactor) that collapses repeat md5s across a database search - see code.

ctb commented 4 years ago

See also #884 which fixes a separate problem I introduced :eyes: where the filename was based not on signature md5sum but rather on signature name, so if you had two different signatures with the same name, one overwrote the other.

ctb commented 4 years ago

is this now resolved by https://github.com/dib-lab/sourmash/pull/994?

luizirber commented 4 years ago

994 solves the "storing duplicates" part, but not the "you are adding a duplicate, is this really what you want?" problem.