skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.04k stars 91 forks source link

Ken embeddings RAM and disk usage #960

Closed jeromedockes closed 1 week ago

jeromedockes commented 1 week ago

fetch_ken_embeddings uses too much memory which causes example 6 to be killed by circle CI

one place where a huge table is loaded is the entities type table which contains all the entity names and their types and takes close to half a GB.

we could cut it up in smaller files for each entity tables:

          table  entities_figshare_id  type_figshare_id  unique_types_figshare_id
0  all_entities              39142985          39266300                  40019230
1        albums              39149066          39266300                  45258133
2     companies              39149072          39266300                  45258136
3        movies              39149069          39266300                  45258130
4         games              39254360          39266300                  40019788
5       schools              39149075          39266300                  45258127

so that above we would have a different type_figshare_id for each table.

how do we access the skrub figshare account / upload new files?

jeromedockes commented 1 week ago

Here is a script to generate the files in question:

import pathlib
import tempfile

import pandas as pd

from skrub.datasets._ken_embeddings import (
    fetch_ken_embeddings,
    fetch_ken_table_aliases,
    _correspondence_table_url,
    fetch_figshare,
)

out_dir = pathlib.Path(tempfile.mkdtemp(suffix="_ken_types"))
print(f"storing types in {out_dir}")
type(out_dir)
aliases = fetch_ken_table_aliases()

correspondence = pd.read_csv(_correspondence_table_url)
embedding_type_id = correspondence["type_figshare_id"].values[0]
emb_type = fetch_figshare(embedding_type_id).X
for table_name in aliases:
    print(table_name)
    entities = fetch_ken_embeddings(embedding_table_id=table_name)["Entity"].apply("<{}>".format)
    table_types = emb_type.merge(entities, on="Entity")
    table_types.to_parquet(out_dir / f"{table_name}.parquet", index=False)

print(f"embedding types are in {out_dir}")
jeromedockes commented 1 week ago

there are still improvements we can do for the ken embeddings but the specific one suggested here has been applied in https://github.com/skrub-data/datasets/pull/10 and circleci is now passing again so I'll close this issue