monarch-initiative / curategpt

LLM-driven curation assist tool (pre-alpha)
https://monarch-initiative.github.io/curategpt/
BSD 3-Clause "New" or "Revised" License
60 stars 11 forks source link

seems to be using duckdb even when chromadb is requested (indexing JSON) #64

Closed turbomam closed 1 month ago

turbomam commented 1 month ago
ls *.json -l

-rw-rw-r-- 1 mark mark 259276 Aug 23 14:47 ncbi_metadata.samples_100.json

see snippet in next post

poetry run curategpt index \
    --collection ncbi_biosamples_100 --database-type chromadb ncbi_metadata.samples_100.json 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/mark/.cache/pypoetry/virtualenvs/curate-gpt-Vu3W-JcT-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/mark/.cache/pypoetry/virtualenvs/curate-gpt-Vu3W-JcT-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/mark/.cache/pypoetry/virtualenvs/curate-gpt-Vu3W-JcT-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mark/.cache/pypoetry/virtualenvs/curate-gpt-Vu3W-JcT-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mark/.cache/pypoetry/virtualenvs/curate-gpt-Vu3W-JcT-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mark/gitrepos/curate-gpt/src/curate_gpt/cli.py", line 308, in index
    if os.path.isdir(path) & (database_type == "duckdb"):
  File "/home/mark/.pyenv/versions/3.10.13/lib/python3.10/genericpath.py", line 42, in isdir
    st = os.stat(s)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
turbomam commented 1 month ago
head -n 50 ncbi_metadata.samples_100.json
[{
  "_id": {
    "$oid": "66c60d3e7329fb22ad1c3d39"
  },
  "BioSample": {
    "Status": {
      "when": "2013-08-05T10:18:49",
      "status": "live"
    },
    "Owner": {
      "Contacts": {
        "Contact": ""
      },
      "Name": {
        "abbreviation": "WUGSC",
        "content": "Washington University, Genome Sequencing Center"
      }
    },
    "access": "public",
    "Description": {
      "Comment": {
        "Paragraph": [
          "Alistipes putredinis (GenBank Accession Number for 16S rDNA gene: L16497) is a member of the Bacteroidetes division of the domain bacteria and has been isolated from human feces. It has been found in 16S rDNA sequence-based enumerations of the colonic microbiota of adult humans (Eckburg et. al. (2005), Ley et. al. (2006)).",
          "Keywords: GSC:MIxS;MIGS:6.0"
        ]
      },
      "Organism": {
        "taxonomy_id": 445970,
        "taxonomy_name": "Alistipes putredinis DSM 17216"
      },
      "Title": "Alistipes putredinis DSM 17216"
    },
    "Attributes": {
      "Attribute": [
        {
          "attribute_name": "finishing strategy (depth of coverage)",
          "content": "Level 3: Improved-High-Quality Draft11.6x;20"
        },
cmungall commented 1 month ago

It's not using duckdb, this is just a bug caused by assuming that --path is set

Specifying --path db (or your preferred path) should fix this

@iQuxLE - can we make this a bit more graceful - also there should be no need for individual CLI commands to do any of this kind of configuration, both duckdb and chomodb implementations conform to the same interface, so no need for switches (except perhaps initial instantiation)

iQuxLE commented 1 month ago

This check was initially thought to prevent users from putting a directory path for a duckdb instance. Chroma would be fine with a directory path, but duckdb needs a file path. However we can adapt this within the implementations.

@cmungall By default when using chromadb and not adding a path it will put the chroma.sqlite3 into /curate-gpt/db The behaviour of the duckdb_adapter is currently not the same. I will make a change so when setting no --path it will redirect to /curate-gpt/db/foo.duckdb. If a user gives a directory bar it will add the foo.duckdb so there will be no DuckDB error like duckdb.duckdb.IOException: IO Error: Could not read from file "~/code/curate-gpt/bar": Is a directory and its automatically transformed to ~/code/curate-gpt/bar/foo.duckdb

This way there is no need for a -p when indexing and the switch in the CLI can be deleted.