sg-wbi / belb

Biomedical Entity Linking Benchmark
Other
9 stars 0 forks source link

Gene KB subset #1

Open Chanwhistle opened 4 months ago

Chanwhistle commented 4 months ago

Hi, I need to find NCBIgene subset of nlm-gene and gnormplus . How did you made subset of those KB?

image

sg-wbi commented 3 months ago

Hi,

thank you for your interest and sorry for the late reply.

If you have already built the KBs (python -m belb.scripts.build_kbs --dir path/to/dir --cores 20) then you should be able to load the kb with subset like this:

from belb import AutoBelbKb
from belb.resources import Kbs

kb = AutoBelbKb.from_name(
                name=Kbs.NCBI_GENE.name,
                directory=path/to/dir,
                db_config=./db.yaml,
                subset="gnormplus",
            )

To get the actual NCBI Taxonomy identifiers of the species in the subset:

from belb.resources import Kbs
from belb.utils import load_kb_subsets

subsets = load_kb_subsets(Kbs.NCBI_GENE.name)
print(subsets['gnormplus']
# [41856, 9986, 10116, 11908, 9606, 3847, 7955, 11676, 4896, 8355, 511145, 8364, 9913, 10298, 7227, 559292, 333760, 9031, 6239, 9823, 10090, 3702]

The subset data is stored in this JSON file in the repository