gyorilab / indra_cogex

INDRA Context Graph Extension
BSD 2-Clause "Simplified" License
7 stars 8 forks source link

FileNotFoundError: No such file: ~/.data/indra/db_processed_statements.tsv.gz #174

Open stephanmg opened 1 month ago

stephanmg commented 1 month ago

I followed the instructions in the README.md, but when running the import.sh script I received the following error:

(indra) sgrein@iru-code:~/Code/indra/indra_cogex$ python -m indra_cogex.sources --process --assemble
/home/sgrein/Code/indra/indra_cogex/venv/indra/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.26.3
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Checking bgee
Identified node paths for assembly: ['/home/sgrein/.data/indra/cogex/bgee/nodes.pkl']
Identified node paths for import: []
Loading cached nodes from /home/sgrein/.data/indra/cogex/bgee/nodes.pkl
Checking indra_db_evidence
Identified node paths for assembly: ['/home/sgrein/.data/indra/cogex/indra_db_evidence/nodes_Publication.pkl']
Identified node paths for import: ['/home/sgrein/.data/indra/cogex/indra_db_evidence/nodes_Evidence.tsv.gz']
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sgrein/Code/indra/indra_cogex/src/indra_cogex/sources/__main__.py", line 8, in <module>
    main()
  File "/home/sgrein/Code/indra/indra_cogex/venv/indra/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/sgrein/Code/indra/indra_cogex/venv/indra/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/sgrein/Code/indra/indra_cogex/venv/indra/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sgrein/Code/indra/indra_cogex/venv/indra/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/sgrein/Code/indra/indra_cogex/src/indra_cogex/sources/cli.py", line 131, in main
    processor = processor_cls(**config.get(processor_cls.name, {}))
  File "/home/sgrein/Code/indra/indra_cogex/src/indra_cogex/sources/indra_db/__init__.py", line 204, in __init__
    raise FileNotFoundError(f"No such file: {self.stmt_fname}")
FileNotFoundError: No such file: /home/sgrein/.data/indra/db/processed_statements.tsv.gz

Steps to produce this error for me: Follow provided instructions in README.md. Any advice is appreciated.

Python version (if that matters): 3.10.12
OS: Ubuntu 22.04.4 LTS
Revision of indra_cogex: HEAD as of *today* (June, 14th, 2024)

Also, gilda and biomappings is not in the list of dependencies. So need to do pip3 install gilda biomappings first of all.

stephanmg commented 1 month ago

I got past this error, but now a new one emerged:

Checking gwas
Identified node paths for assembly: ['/home/sgrein/.data/indra/cogex/gwas/nodes.pkl']
Identified node paths for import: []
INFO: [2024-06-14 18:40:23] pystow.utils - downloading with urllib from https://www.ebi.ac.uk/gwas/api/search/downloads/full to /home/sgrein/.data/indra/cogex/gwas/associations.tsv
/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pystow/impl.py:632: DtypeWarning: Columns (11,12,13,23) have mixed types. Specify dtype option on import or set low_memory=False.
  return pd.read_csv(path, **_clean_csv_kwargs(read_csv_kwargs))
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sgrein/Code/indra/indra_cogex_latest/src/indra_cogex/sources/__main__.py", line 8, in <module>
    main()
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/sgrein/Code/indra/indra_cogex_latest/src/indra_cogex/sources/cli.py", line 146, in main
    processor = processor_cls(**config.get(processor_cls.name, {}))
  File "/home/sgrein/Code/indra/indra_cogex_latest/src/indra_cogex/sources/gwas/__init__.py", line 34, in __init__
    self.df = load_data(GWAS_URL)
  File "/home/sgrein/Code/indra/indra_cogex_latest/src/indra_cogex/sources/gwas/__init__.py", line 94, in load_data
    df = SUBMODULE.ensure_csv(url=url, name="associations.tsv", force=force)
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pystow/impl.py", line 632, in ensure_csv
    return pd.read_csv(path, **_clean_csv_kwargs(read_csv_kwargs))
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1968, in read
    df = DataFrame(
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pandas/core/frame.py", line 778, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 152, in arrays_to_mgr
    return create_block_manager_from_column_arrays(
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 2144, in create_block_manager_from_column_arrays
    mgr._consolidate_inplace()
  File "/home/sgrein/Code/indra/indra_cogex_latest/venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 1791, in _consolidate_inplace
    self._rebuild_blknos_and_blklocs()
  File "internals.pyx", line 784, in pandas._libs.internals.BlockManager._rebuild_blknos_and_blklocs
AssertionError: Gaps in blk ref_locs
stephanmg commented 1 month ago

Additional info: pip freeze:


adeft==0.12.3
aniso8601==9.0.1
appdirs==1.4.4
asttokens==2.4.1
attrs==23.2.0
beautifulsoup4==4.12.3
biomappings==0.3.7
bioontologies==0.4.3
bioregistry==0.11.8
bioversions==0.5.403
blinker==1.8.2
boto3==1.34.126
botocore==1.34.126
cachier==3.0.0
certifi==2024.6.2
charset-normalizer==3.3.2
chembl_downloader==0.4.5
class_resolver==0.4.3
click==8.1.7
click-default-group==1.2.4
curies==0.7.9
dataclasses-json==0.6.7
decorator==5.1.1
defusedxml==0.7.1
drugbank-downloader==0.1.1
enum34==1.1.10
exceptiongroup==1.2.1
executing==2.0.1
fairsharing-client==0.1.0
Flask==3.0.3
flask-restx==1.3.0
future==1.0.0
gilda==1.2.1
humanize==4.9.0
idna==3.7
ijson==3.3.0
importlib_resources==6.4.0
indra==1.22.0
-e git+https://github.com/bgyori/indra_cogex@40602ddeb5b3fbf7496785cb7cb75672312b23f5#egg=indra_cogex
ipython==8.25.0
itsdangerous==2.2.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
lxml==5.2.2
MarkupSafe==2.1.5
marshmallow==3.21.3
matplotlib-inline==0.1.7
more-click==0.1.2
more-itertools==10.3.0
mpmath==1.3.0
mypy-extensions==1.0.0
ndex2==2.0.1
neo4j==5.21.0
networkx==3.3
nltk==3.8.1
numpy==1.26.4
objectpath==0.6.1
obonet==1.0.0
packaging==24.1
pandas==2.2.2
parso==0.8.4
patsy==0.5.6
pexpect==4.9.0
portalocker==2.8.2
prompt_toolkit==3.0.47
protmapper==0.0.29
psycopg2-binary==2.9.9
ptyprocess==0.7.0
pure-eval==0.2.2
pybiopax==0.1.5
pydantic==1.10.16
Pygments==2.18.0
pyobo==0.10.11
pysb==1.16.0
pysolr==3.9.0
pystow==0.5.4
python-dateutil==2.9.0.post0
PyTrie==0.4.0
pytz==2024.1
PyYAML==6.0.1
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
requests-ftp==0.3.1
requests-toolbelt==1.0.0
rpds-py==0.18.1
s3transfer==0.10.1
scikit-learn==1.4.2
scipy==1.13.1
six==1.16.0
sortedcontainers==2.4.0
soupsieve==2.5
stack-data==0.6.3
statsmodels==0.14.2
sympy==1.11.1
tabulate==0.9.0
threadpoolctl==3.5.0
tqdm==4.66.4
traitlets==5.14.3
typing-inspect==0.9.0
typing_extensions==4.12.2
tzdata==2024.1
umls_downloader==0.1.3
Unidecode==1.3.8
urllib3==2.2.1
watchdog==4.0.1
wcwidth==0.2.13
Werkzeug==3.0.3
zenodo_client==0.3.4
bgyori commented 1 month ago

Hi @stephanmg, the sources here include ones that are publicly available as standalone resources and some that are not. The processed_statements.tsv.gz file is a custom file exported from the INDRA DB that we haven't published yet. Its processed content is available via discovery.indra.bio's API though.

stephanmg commented 1 month ago

Hi @bgyori, thanks for your quick answer.

I was succesful previously in building some of the sources of indra_cogex. However we need the knowledge graph with the relationship indra_rel. That's why I tried to build all sources, as you see above I'm facing some other errors as well.

We currently have a knowledge graph with the following relations: variant_phenotype_association', 'expressed_in', 'has_domain', 'has_marker', 'phenotype_has_gene', 'associated_with', 'isa', 'haspart', 'mutated_in', 'copy_number_altered_in', 'sensitive_to'

But we need indra_rel, this seems to come from the source indra_db. Any chance I can get that working and to import into Neo4j on our infrastructure?

We are aware of the REST API, etc. but would need the Neo4j graph.