monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
15 stars 1 forks source link

Fix upload and index to KGHub #580

Closed glass-ships closed 3 months ago

glass-ships commented 3 months ago

Closes #500

Would like to get @caufieldjh and @justaddcoffee 's inputs on this PR before merging as well just to be sure.

In particular, the multi-indexer step was previously the first thing run... this should go after uploading the files, correct?

caufieldjh commented 3 months ago

In particular, the multi-indexer step was previously the first thing run... this should go after uploading the files, correct?

Nope, the other way around. The multi-indexer does the indexing locally, so when you upload a full directory everything, including the index, goes with it

glass-ships commented 3 months ago

Nope, the other way around. The multi-indexer does the indexing locally, so when you upload a full directory everything, including the index, goes with it

I see... That's the way I had run it locally with multi indexing coming first, and upload seemed to run successfully, but I still don't see anything on https://kg-hub.berkeleybop.io/kg-monarch/ Are there additional steps required to have it show up? (Do you even see it on the S3 bucket?)

caufieldjh commented 3 months ago

I see... That's the way I had run it locally with multi indexing coming first, and upload seemed to run successfully, but I still don't see anything on https://kg-hub.berkeleybop.io/kg-monarch/ Are there additional steps required to have it show up? (Do you even see it on the S3 bucket?)

Yes, I see the new builds from 20240314 and 20240318 on the bucket! Success! The missing detail is that the multi-indexer also needs to be called on the root project index. In KG-Phenio that happens here, for instance: https://github.com/Knowledge-Graph-Hub/kg-phenio/blob/3dcb5bff41e85087ee34977b60252d2b39f5858e/Jenkinsfile#L191-L198 That way the index at https://kghub.io/kg-monarch/index.html will reflect the most recent uploads. Some of these builds, including KG-Phenio, have an additional cache invalidation step at the end so the changes become immediately visible, but I'm not 100% certain that's necessary since the individual uploads use the --cf-invalidate flag.

glass-ships commented 3 months ago

Nice! Ok. 20240314 was a test upload and should probably be deleted if possible, I wasn't sure how to do that via gsutil...

so I should add something like:

sh.multi_indexer(*f"-v --prefix https://kghub.io/kg-monarch/kg-monarch/ -b kg-hub-public-data -r kg-monarch -x".split(" ")

after the upload step?

caufieldjh commented 3 months ago

20240314 was a test upload and should probably be deleted if possible OK, will remove that one.

That call to multi_indexer looks right except for the prefix - I think it should just be https://kghub.io/kg-monarch/

glass-ships commented 3 months ago

Oops yeah, double pasted. ok thanks! I've added that in.

Also ran the multi-indexer command locally, which output a good amount of content, whenever the site updates I expect we'll see a lot of the uploads we previously hadn't (??)

caufieldjh commented 3 months ago

Yes! I'd expect to see all 25 builds, including the most recent one.

glass-ships commented 3 months ago

Awesome, thanks so much for your help on this! Hopefully merging this PR actually resolves the issue going forward

glass-ships commented 3 months ago

OH shoot i just realized i have an uncommitted index.html file as a result of that multi-indexer command.

Which makes me wanna double-check, that second command should happen after upload? or also before

caufieldjh commented 3 months ago

The second command should happen after all the build artifacts have been uploaded, but then the index.html it outputs also needs to be uploaded. Sorry, I should have caught that.

glass-ships commented 3 months ago

Sweet ok, not a problem! If you're ok with it, I tagged you for review on this PR adding that step in, I just wanna be super duper sure we get this right before the next build goes out