Closed justaddcoffee closed 6 months ago
Is there anything in the upload process to update the index.html on kg-hub? It's true that the dashes aren't exactly consistent with the other build name formats, but I think the larger issue is that nothing is rewriting the index. The other KG builds generally use the multi-indexer utility - here's how it happens in KG-IDG, for example: https://github.com/Knowledge-Graph-Hub/kg-idg/blob/cd3d6f868d20f85ef11bdd68182ddf949e586b2b/Jenkinsfile#L216-L229
Oh, the cli_utils
calls this already:
https://github.com/monarch-initiative/monarch-ingest/blob/8816482d94e8f756a6770728cb92b95cdac81785/src/monarch_ingest/cli_utils.py#L390
So something is going wrong there.
Us this still true? or was it fixed back in August? @kevinschaper
It's still only showing the builds without dashes.
@glass-ships did we look at using a dash-less release destination when copying up to kghub?
I thought we fixed that:
if kghub:
kghub_release_ver = release_ver.replace("-", "")
sh.mkdir("-p", f"{dir}/stats")
sh.mv(f"{dir}/merged_graph_stats.yaml", f"{dir}/stats")
sh.multi_indexer(
*f"-v --directory {dir} --prefix https://kg-hub.berkeleybop.io/kg-monarch/{kghub_release_ver} -x -u".split(
" "
)
)
sh.gsutil(
*"-q -m -cp -r -a public-read".split(" "),
f"{dir}/*", # source files
f"s3://kg-hub-public-data/kg-monarch/{kghub_release_ver}", # destination
)
sh.gsutil(
*"-q -m cp -r -a public-read".split(" "), # make public
f"{dir}/*", # source files
"s3://kg-hub-public-data/kg-monarch/current", # destination
)
maybe this isn't doing what i thought it is?
It must not be because the index.html
at s3://kg-hub-public-data/kg-monarch/
hasn't changed since 2023-01-26,
and the s3://kg-hub-public-data/kg-monarch/current/index.html
hasn't changed since 2023-10-28 (and isn't publicly accessible so it 403's).
There's something strange going on with the uploads anyway - it looks like they are going in both /current and the parent directory, so we have both s3://kg-hub-public-data/kg-monarch/20231028/
and s3://kg-hub-public-data/kg-monarch/current/2023-10-28
, though the latter is empty.
uploading to current/
and <release_version>/
is expected (though not nested like that...)
interesting. i'll do some experimenting today
yep, the upload to current/
looks like it works as expected - right now it's identical to the 20231028
release, except for a bunch of empty files that look like they were intended to be directories:
DIR s3://kg-hub-public-data/kg-monarch/current/2023-03-10/
DIR s3://kg-hub-public-data/kg-monarch/current/2023-03-12/
DIR s3://kg-hub-public-data/kg-monarch/current/2023-03-14/
DIR s3://kg-hub-public-data/kg-monarch/current/2023-03-16/
DIR s3://kg-hub-public-data/kg-monarch/current/qc/
DIR s3://kg-hub-public-data/kg-monarch/current/rdf/
DIR s3://kg-hub-public-data/kg-monarch/current/stats/
DIR s3://kg-hub-public-data/kg-monarch/current/transform_output/
2023-04-15 04:02 0 s3://kg-hub-public-data/kg-monarch/current/2023-04-15
2023-04-16 15:53 0 s3://kg-hub-public-data/kg-monarch/current/2023-04-16
2023-04-25 17:09 0 s3://kg-hub-public-data/kg-monarch/current/2023-04-25
2023-04-27 23:54 0 s3://kg-hub-public-data/kg-monarch/current/2023-04-27
2023-05-03 02:21 0 s3://kg-hub-public-data/kg-monarch/current/2023-05-03
2023-05-14 16:10 0 s3://kg-hub-public-data/kg-monarch/current/2023-05-14
2023-05-21 15:42 0 s3://kg-hub-public-data/kg-monarch/current/2023-05-21
2023-05-25 08:49 0 s3://kg-hub-public-data/kg-monarch/current/2023-05-25
2023-05-31 10:39 0 s3://kg-hub-public-data/kg-monarch/current/2023-05-31
2023-06-01 10:50 0 s3://kg-hub-public-data/kg-monarch/current/2023-06-01
2023-06-04 17:12 0 s3://kg-hub-public-data/kg-monarch/current/2023-06-04
2023-06-08 08:44 0 s3://kg-hub-public-data/kg-monarch/current/2023-06-08
2023-06-11 17:33 0 s3://kg-hub-public-data/kg-monarch/current/2023-06-11
2023-08-24 22:44 0 s3://kg-hub-public-data/kg-monarch/current/2023-08-24
2023-09-15 02:32 0 s3://kg-hub-public-data/kg-monarch/current/2023-09-15
2023-09-28 04:36 0 s3://kg-hub-public-data/kg-monarch/current/2023-09-28
2023-10-17 06:46 0 s3://kg-hub-public-data/kg-monarch/current/2023-10-17
2023-10-28 05:26 0 s3://kg-hub-public-data/kg-monarch/current/2023-10-28
2023-10-28 05:26 2263 s3://kg-hub-public-data/kg-monarch/current/index.html
2023-06-11 17:33 31426 s3://kg-hub-public-data/kg-monarch/current/merged_graph_stats.yaml
2023-10-28 05:26 1131743705 s3://kg-hub-public-data/kg-monarch/current/monarch-kg-denormalized-edges.tsv.gz
2023-10-28 05:26 2820126334 s3://kg-hub-public-data/kg-monarch/current/monarch-kg.db.gz
2023-10-28 05:26 183157562 s3://kg-hub-public-data/kg-monarch/current/monarch-kg.jsonl.tar.gz
2023-10-28 05:26 2527486388 s3://kg-hub-public-data/kg-monarch/current/monarch-kg.neo4j.dump
2023-10-28 05:26 487325911 s3://kg-hub-public-data/kg-monarch/current/monarch-kg.nt.gz
2023-10-28 05:26 139174763 s3://kg-hub-public-data/kg-monarch/current/monarch-kg.tar.gz
2023-10-28 05:26 2163565093 s3://kg-hub-public-data/kg-monarch/current/phenio.db.gz
2023-10-28 05:26 56650 s3://kg-hub-public-data/kg-monarch/current/qc_report.yaml
2023-10-28 05:26 5235780125 s3://kg-hub-public-data/kg-monarch/current/solr.tar.gz
Commenting as a reminder to myself, mostly:
It turns out the issue is boto authentication to the S3 bucket is failing despite, it seems there might be extra steps beyond setting the AWS_
environment variables
Only two builds are showing up on the KG-Hub for kg-monarch, see here:
Possibly what's going on is that KG-Hub is not expecting dashes in the build ISO date:
Note above the only the 20221211 and 20230122 builds are showing up, i.e. the ones without dashes
One possible fix would be for Monarch to remove dashes before pushing to KG-Hub's s3 bucket. Alternatively, KG-Hub could be more flexible and look for build names (ISO dates) with dashes in them
@caufieldjh @kevinschaper what do you think?