Open vv111y opened 2 months ago
I removed the cache, double checked .env
, and tried the following minimal settings.yaml, and still same error.
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: meta-llama/Llama-3-8b-chat-hf
api_base: https://api.together.xyz/v1
embeddings:
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: togethercomputer/m2-bert-80M-2k-retrieval
api_base: https://api.together.xyz/v1
chunks:
size: 300
overlap: 100
group_by_columns: [id]
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
entity_extraction:
prompt: "prompts/entity_extraction.txt"
entity_types: [organization, person, geo, event]
max_gleanings: 0
Are you trying to run indexing using the command line interface? i.e. python -m graphrag.indexing ...
?
I added a change last week that should address your problem. A new command line flag --overlay-defaults
will be available that inherits default values (i.e. the workflow steps that are missing from your yaml) in addition to the custom values that your config has declared.
You can either build the python package from source (run poetry build
from the root directory of this repo and re-install the wheel) or wait until the next release to start using this new feature.
right, I should have specified that.
python -m graphrag.index --config <some-settings.yaml> --root .
To be clear I tried multiple settings.yaml
files including ones that spedified execution of all work units. All resulted in no workflow steps.
I'm installing via main branch and can use --overlay-defaults
.
pip install git+https://github.com/microsoft/graphrag@main
But settings are being ignored still. --overlay-defaults
seems to act as a bandaid for some settings. For example, when I add
embed_graph:
enabled: true # if true, will generate node2vec embeddings for nodes
num_walks: 10
walk_length: 40
window_size: 2
iterations: 3
random_seed: 597832
umap:
enabled: true # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: true
raw_entities: true
top_level_nodes: true
embed_graph
, graphml
, raw_entities
, umap
, and top_level_nodes
are not being generated.
additionally, when I try a local search there seems to be missing lancedb dataset, see first line below. The last line I wonder if that is an issue with trying to run in colab and maybe a separate issue.
[2024-07-12T14:44:16Z WARN lance::dataset] No existing dataset at /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance, it will be created
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/graphrag/query/__main__.py", line 76, in <module>
run_local_search(
File "/usr/local/lib/python3.10/dist-packages/graphrag/query/cli.py", line 132, in run_local_search
store_entity_semantic_embeddings(
File "/usr/local/lib/python3.10/dist-packages/graphrag/query/input/loaders/dfs.py", line 91, in store_entity_semantic_embeddings
vectorstore.load_documents(documents=documents)
File "/usr/local/lib/python3.10/dist-packages/graphrag/vector_stores/lancedb.py", line 55, in load_documents
self.document_collection = self.db_connection.create_table(
File "/usr/local/lib/python3.10/dist-packages/lancedb/db.py", line 418, in create_table
tbl = LanceTable.create(
File "/usr/local/lib/python3.10/dist-packages/lancedb/table.py", line 1545, in create
lance.write_dataset(empty, tbl._dataset_uri, schema=schema, mode=mode)
File "/usr/local/lib/python3.10/dist-packages/lance/dataset.py", line 2506, in write_dataset
inner_ds = _write_dataset(reader, uri, params)
OSError: LanceError(IO): Generic LocalFileSystem error: Unable to copy file from /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance/_versions/.tmp_1.manifest_add4893a-5209-4899-81ae-c25465719626 to /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance/_versions/1.manifest: Function not implemented (os error 38), /home/runner/work/lance/lance/rust/lance-table/src/io/commit.rs:692:54
@jgbradley1 the issue is still not fixed, only partially, several artifacts are still not produced - the settings file is being, at least partly, ignored. Is there some issue with maybe say whitespace malformed yaml? Just guessing now
As posted above,
embed_graph:
enabled: true # if true, will generate node2vec embeddings for nodes
num_walks: 10
walk_length: 40
window_size: 2
iterations: 3
random_seed: 597832
umap:
enabled: true # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: true
raw_entities: true
top_level_nodes: true
embed_graph, graphml, raw_entities, umap, and top_level_nodes are not being generated.
This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.
Having a similar issue, default settings.yaml works fine. I tried the prompt tuning, and put the tuned prompts inside prompts_tuned
folder, I copied the settings.yaml to settings_prompts_tuned.yaml
and update all the prompt, cache, output paths, when I index there are two issues: 1. empty workflow, 2. indexing-engine.log
still generated inside output
folder instead of output_prompts_tuned
folder, while logs.json
is generated inside output_prompts_tuned
folder.
After a bit debugging, I found that --config
and --overlay-defaults
have to be used together, only use --config
will cause empty workflow issue. also indexing-engine.log
path is hard coded into output
folder in _enable_logging()
function. My experiment is based at commit c749fe2.
Describe the bug
Using in google colab. I used several different settings.yaml files to try to get it to work, including initial stock with .env file. One time starting in a new folder from scratch it worked partly (errored out before all workflow tasks done), but then after problem persists. I can see no pattern for the cause. please see indexing-engine.log
Steps to reproduce
note, error:
Expected Behavior
workflow list should be fully populated and all tasks run correctly. At best have only had a few partial runs, now nothing is done
GraphRAG Config Used
Logs and screenshots
indexing-engine.log
Additional Information