Closed jeromedockes closed 1 week ago
Here is a script to generate the files in question:
import pathlib
import tempfile
import pandas as pd
from skrub.datasets._ken_embeddings import (
fetch_ken_embeddings,
fetch_ken_table_aliases,
_correspondence_table_url,
fetch_figshare,
)
out_dir = pathlib.Path(tempfile.mkdtemp(suffix="_ken_types"))
print(f"storing types in {out_dir}")
type(out_dir)
aliases = fetch_ken_table_aliases()
correspondence = pd.read_csv(_correspondence_table_url)
embedding_type_id = correspondence["type_figshare_id"].values[0]
emb_type = fetch_figshare(embedding_type_id).X
for table_name in aliases:
print(table_name)
entities = fetch_ken_embeddings(embedding_table_id=table_name)["Entity"].apply("<{}>".format)
table_types = emb_type.merge(entities, on="Entity")
table_types.to_parquet(out_dir / f"{table_name}.parquet", index=False)
print(f"embedding types are in {out_dir}")
there are still improvements we can do for the ken embeddings but the specific one suggested here has been applied in https://github.com/skrub-data/datasets/pull/10 and circleci is now passing again so I'll close this issue
fetch_ken_embeddings
uses too much memory which causes example 6 to be killed by circle CIone place where a huge table is loaded is the entities type table which contains all the entity names and their types and takes close to half a GB.
we could cut it up in smaller files for each entity tables:
so that above we would have a different
type_figshare_id
for eachtable
.how do we access the skrub figshare account / upload new files?