monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

KG-Hub not showing new builds #500

Closed justaddcoffee closed 6 months ago

justaddcoffee commented 1 year ago

Only two builds are showing up on the KG-Hub for kg-monarch, see here:

Screen Shot 2023-08-16 at 11 06 34 AM

Possibly what's going on is that KG-Hub is not expecting dashes in the build ISO date:

(base) ~ $ s3cmd ls s3://kg-hub-public-data/kg-monarch/
                          DIR  s3://kg-hub-public-data/kg-monarch/20221211/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-03-10/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-03-12/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-03-14/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-03-16/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-04-15/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-04-16/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-04-25/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-04-27/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-05-03/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-05-14/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-05-21/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-05-25/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-05-31/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-06-01/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-06-04/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-06-08/
                          DIR  s3://kg-hub-public-data/kg-monarch/2023-06-11/
                          DIR  s3://kg-hub-public-data/kg-monarch/20230122/
                          DIR  s3://kg-hub-public-data/kg-monarch/current/
2023-01-26 18:42          795  s3://kg-hub-public-data/kg-monarch/index.html

Note above the only the 20221211 and 20230122 builds are showing up, i.e. the ones without dashes

One possible fix would be for Monarch to remove dashes before pushing to KG-Hub's s3 bucket. Alternatively, KG-Hub could be more flexible and look for build names (ISO dates) with dashes in them

@caufieldjh @kevinschaper what do you think?

caufieldjh commented 1 year ago

Is there anything in the upload process to update the index.html on kg-hub? It's true that the dashes aren't exactly consistent with the other build name formats, but I think the larger issue is that nothing is rewriting the index. The other KG builds generally use the multi-indexer utility - here's how it happens in KG-IDG, for example: https://github.com/Knowledge-Graph-Hub/kg-idg/blob/cd3d6f868d20f85ef11bdd68182ddf949e586b2b/Jenkinsfile#L216-L229

caufieldjh commented 1 year ago

Oh, the cli_utils calls this already: https://github.com/monarch-initiative/monarch-ingest/blob/8816482d94e8f756a6770728cb92b95cdac81785/src/monarch_ingest/cli_utils.py#L390

So something is going wrong there.

monicacecilia commented 6 months ago

Us this still true? or was it fixed back in August? @kevinschaper

kevinschaper commented 6 months ago

It's still only showing the builds without dashes.

@glass-ships did we look at using a dash-less release destination when copying up to kghub?

glass-ships commented 6 months ago

I thought we fixed that:

 if kghub:
            kghub_release_ver = release_ver.replace("-", "")
            sh.mkdir("-p", f"{dir}/stats")
            sh.mv(f"{dir}/merged_graph_stats.yaml", f"{dir}/stats")
            sh.multi_indexer(
                *f"-v --directory {dir} --prefix https://kg-hub.berkeleybop.io/kg-monarch/{kghub_release_ver} -x -u".split(
                    " "
                )
            )
            sh.gsutil(
                *"-q -m -cp -r -a public-read".split(" "),
                f"{dir}/*",  # source files
                f"s3://kg-hub-public-data/kg-monarch/{kghub_release_ver}",  # destination
            )
            sh.gsutil(
                *"-q -m cp -r -a public-read".split(" "),  # make public
                f"{dir}/*",  # source files
                "s3://kg-hub-public-data/kg-monarch/current",  # destination
            )

maybe this isn't doing what i thought it is?

caufieldjh commented 6 months ago

It must not be because the index.html at s3://kg-hub-public-data/kg-monarch/ hasn't changed since 2023-01-26, and the s3://kg-hub-public-data/kg-monarch/current/index.html hasn't changed since 2023-10-28 (and isn't publicly accessible so it 403's). There's something strange going on with the uploads anyway - it looks like they are going in both /current and the parent directory, so we have both s3://kg-hub-public-data/kg-monarch/20231028/ and s3://kg-hub-public-data/kg-monarch/current/2023-10-28, though the latter is empty.

glass-ships commented 6 months ago

uploading to current/ and <release_version>/ is expected (though not nested like that...) interesting. i'll do some experimenting today

caufieldjh commented 6 months ago

yep, the upload to current/ looks like it works as expected - right now it's identical to the 20231028 release, except for a bunch of empty files that look like they were intended to be directories:

                          DIR  s3://kg-hub-public-data/kg-monarch/current/2023-03-10/
                          DIR  s3://kg-hub-public-data/kg-monarch/current/2023-03-12/
                          DIR  s3://kg-hub-public-data/kg-monarch/current/2023-03-14/
                          DIR  s3://kg-hub-public-data/kg-monarch/current/2023-03-16/
                          DIR  s3://kg-hub-public-data/kg-monarch/current/qc/
                          DIR  s3://kg-hub-public-data/kg-monarch/current/rdf/
                          DIR  s3://kg-hub-public-data/kg-monarch/current/stats/
                          DIR  s3://kg-hub-public-data/kg-monarch/current/transform_output/
2023-04-15 04:02            0  s3://kg-hub-public-data/kg-monarch/current/2023-04-15
2023-04-16 15:53            0  s3://kg-hub-public-data/kg-monarch/current/2023-04-16
2023-04-25 17:09            0  s3://kg-hub-public-data/kg-monarch/current/2023-04-25
2023-04-27 23:54            0  s3://kg-hub-public-data/kg-monarch/current/2023-04-27
2023-05-03 02:21            0  s3://kg-hub-public-data/kg-monarch/current/2023-05-03
2023-05-14 16:10            0  s3://kg-hub-public-data/kg-monarch/current/2023-05-14
2023-05-21 15:42            0  s3://kg-hub-public-data/kg-monarch/current/2023-05-21
2023-05-25 08:49            0  s3://kg-hub-public-data/kg-monarch/current/2023-05-25
2023-05-31 10:39            0  s3://kg-hub-public-data/kg-monarch/current/2023-05-31
2023-06-01 10:50            0  s3://kg-hub-public-data/kg-monarch/current/2023-06-01
2023-06-04 17:12            0  s3://kg-hub-public-data/kg-monarch/current/2023-06-04
2023-06-08 08:44            0  s3://kg-hub-public-data/kg-monarch/current/2023-06-08
2023-06-11 17:33            0  s3://kg-hub-public-data/kg-monarch/current/2023-06-11
2023-08-24 22:44            0  s3://kg-hub-public-data/kg-monarch/current/2023-08-24
2023-09-15 02:32            0  s3://kg-hub-public-data/kg-monarch/current/2023-09-15
2023-09-28 04:36            0  s3://kg-hub-public-data/kg-monarch/current/2023-09-28
2023-10-17 06:46            0  s3://kg-hub-public-data/kg-monarch/current/2023-10-17
2023-10-28 05:26            0  s3://kg-hub-public-data/kg-monarch/current/2023-10-28
2023-10-28 05:26         2263  s3://kg-hub-public-data/kg-monarch/current/index.html
2023-06-11 17:33        31426  s3://kg-hub-public-data/kg-monarch/current/merged_graph_stats.yaml
2023-10-28 05:26   1131743705  s3://kg-hub-public-data/kg-monarch/current/monarch-kg-denormalized-edges.tsv.gz
2023-10-28 05:26   2820126334  s3://kg-hub-public-data/kg-monarch/current/monarch-kg.db.gz
2023-10-28 05:26    183157562  s3://kg-hub-public-data/kg-monarch/current/monarch-kg.jsonl.tar.gz
2023-10-28 05:26   2527486388  s3://kg-hub-public-data/kg-monarch/current/monarch-kg.neo4j.dump
2023-10-28 05:26    487325911  s3://kg-hub-public-data/kg-monarch/current/monarch-kg.nt.gz
2023-10-28 05:26    139174763  s3://kg-hub-public-data/kg-monarch/current/monarch-kg.tar.gz
2023-10-28 05:26   2163565093  s3://kg-hub-public-data/kg-monarch/current/phenio.db.gz
2023-10-28 05:26        56650  s3://kg-hub-public-data/kg-monarch/current/qc_report.yaml
2023-10-28 05:26   5235780125  s3://kg-hub-public-data/kg-monarch/current/solr.tar.gz
glass-ships commented 6 months ago

Commenting as a reminder to myself, mostly: It turns out the issue is boto authentication to the S3 bucket is failing despite, it seems there might be extra steps beyond setting the AWS_ environment variables