ucsdlib / damsmanager

DAMS Manager
Other
3 stars 1 forks source link

CIL harvesting and metadata transformation processes #289

Open arwenhutt opened 6 years ago

arwenhutt commented 6 years ago

Descriptive summary

Write scripts for regular harvesting and transformation of content (files and metadata) from the Cell Image Library GitHub repository: https://github.com/slash-segmentation/CIL_Public_Data_JSON

Documentation

  1. Outline of Ongoing Harvesting processes https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit#heading=h.caf5ayev1pza
  2. Mapping for creation of subject heading ingest file and object level record ingest file: https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337

Related work

lsitu commented 5 years ago

@arwenhutt Can you give me the permission to the share doc Mapping for creation of subject heading ingest file and object level record ingest file so that I can take a look? Thanks.

arwenhutt commented 5 years ago

@lsitu done!

lsitu commented 5 years ago

Thanks @arwenhutt. I am able to see it now.

lsitu commented 5 years ago

@arwenhutt Could you show me an example for row#75 Create a new row for child component for json file in Mapping for creation of subject heading ingest file and object level record ingest file? Thanks.

arwenhutt commented 5 years ago

@lsitu In other words, we want to add a final child component to the object with the source metadata (the json file itself) attached as a file. We aren't mapping all of the source metadata into the dams so this gives a user access to the full source metadata in case they need it.

Let me know if you would still like an example ingest file, I can do that but this week is very busy and I'm not sure when I'll be able to get to it!

lsitu commented 5 years ago

@arwenhutt I see. Thanks.

lsitu commented 5 years ago

@arwenhutt Are there any identifiers in the CIL object that can be used to check whether a JSON file (object) is ingested or not? For date:creation mapping (row#7-9) in [Mapping for creation of subject heading ingest file and object level record ingest file],(https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337), is three CIL_CCDB.CIL.CORE.ATTRIBUTION.DATE values in an array or just one? Could you show me an example? Thanks.

mcritchlow commented 5 years ago

Hi @lsitu - I can help with the 2nd part, I'm not clear on the 1st part of the question re: identifiers in the object.

For the dates, I believe it can be an array with 1-m values.

I did a quick grep in the github repo grep -iRl "DATE" ./ to find some examples.

Here's one: Version8_6/DATA/CIL_PUBLIC_DATA/40685.json

I'm not sure that answers your question or not.

arwenhutt commented 5 years ago

@lsitu I'm not sure I understand your first question either... is this for step

  1. Content for harvesting identified. Conditions: a. Not already harvested

from the Outline of Ongoing Harvesting processes document? https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit#heading=h.caf5ayev1pza

lsitu commented 5 years ago

@arwenhutt Yes, I was thinking about the step above. In this case, we may need an identifier with the file name. At this time, since we'll download the new JSON source files from github https://github.com/slash-segmentation/CIL_Public_Data_JSON and place them in a folder with a date like cil_harvest_[YYYY-MM-DD], I think it should work to just do a nightly process to convert all those JSON source files at one time. But while moving forward, I am not sure whether there will be any exceptions or not. Also, the mapping for Note:description in line#43 (CIL_CCDB.CIL.CORE.IMAGEDESCRIPTION) should be CIL_CCDB.CIL.CORE.IMAGEDESCRIPTION.free_text (see https://github.com/slash-segmentation/CIL_Public_Data_JSON/blob/master/Version8_6/DATA/CIL_PUBLIC_DATA/10001.json#L125). Is it correct?

arwenhutt commented 5 years ago

@lsitu yes, you are correct! since this identifier applies to the object as a whole, I've added it to the mapping as identifier:samplenumber (line 11)

And yes, you are correct - CIL_CCDB.CIL.CORE.IMAGEDESCRIPTION should be CIL_CCDB.CIL.CORE.IMAGEDESCRIPTION.free_text. I've edited that in the Object Level Record ingest file sheet. Thanks!

lsitu commented 5 years ago

@arwenhutt I think the work to support CIL Harvesting is done in the following related tickets:

291

292

297

Do you think how often we should run a cron job to check for new JSON source files added on git hub repository https://github.com/slash-segmentation/CIL_Public_Data_JSON? Could you give more specifically regarding item 6. Notification of harvest completion sent to [DOMM, JIRA, and Ho Jung] in https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit#heading=h.caf5ayev1pza? I think we can sent out emails once the work for CIL Harvesting is done.

gamontoya commented 5 years ago

@lsitu Is this ticket open because you are waiting on @arwenhutt?

lsitu commented 5 years ago

@gamontoya Yes. We need input from @arwenhutt . See https://github.com/ucsdlib/damsmanager/issues/289#issuecomment-450940194 above.

arwenhutt commented 5 years ago

@lsitu re: your questions, I checked with @hjsyoo and we think: 1) Chron Frequency - monthly 2) harvest notification - email is good, sent to @hjsyoo and @arwenhutt

lsitu commented 5 years ago

@mcritchlow / @arwenhutt I've added PR https://github.com/ucsdlib/damsmanager/pull/304 for emails notification and PR https://github.com/ucsdlib/private_config/pull/12 for the configuration we need in damsamanager. Both PRs are ready for review now. And ticket https://github.com/ucsdlib/cil-sync/issues/2 is created for the monthly cron job we need for CIL harvesting.

hjsyoo commented 5 years ago

@gamontoya Can you update me on the status of this work? Is there anything for me or @arwenhutt to review, or are there some dependencies that need to be addressed first?

gamontoya commented 5 years ago

@hjsyoo I think @lsitu can confirm, but I think this work is complete.

lsitu commented 5 years ago

@gamontoya / @hjsyoo Yes, the work is done. We just need to setup a cron job to run it: Create cron job for CIL syn. @rstanonik / @mcritchlow Are we going to setup Kubernetes CronJob for it? Thanks.

rstanonik commented 5 years ago

If this should run in kubernetes (or even docker), then are you going to create a docker image?

I looked at the "Create cron job" link. The argument to running the code looks like

/pub/data2/damsmanager/dams-staging/rdcp-staging/rdcp-0126-cil

If this runs as a cron job, does it always use the same argument "rdcp-0126-cil"?

lsitu commented 5 years ago

@rstanonik From the documentation in 1. Outline of Ongoing Harvesting processes in the description text, I think we always pass the same path /pub/data2/damsmanager/dams-staging/rdcp-staging/rdcp-0126-cil to rdcp-staging to store the results. Yes, I think we can create a docker file when needed. Do you have an example of the docker image that works for kubernetes? Thanks.

rstanonik commented 5 years ago

Take a look at how Vivian set up ldapsync

https://gitlab.com/ucsd-gitlab/library/ldap_sync

rstanonik commented 5 years ago

She creates a docker image and pushes it to gitlab's repository. Then docker or kubernetes can pull it.

rstanonik commented 5 years ago

But maybe it should actually go under

https://gitlab.com/surfliner

so it doesn't violate the gitlab academic license.

rstanonik commented 5 years ago

I'm trying to run this by hand and it fails because "No such file or directory @ rb_sysopen cil_harvest_2019-02-21/metadata_processed/json_files.txt".

What input files does it expect to exist?

What output files will it create?

Should this be deployed first to dev or staging?

rstanonik commented 5 years ago

What ruby version?

arwenhutt commented 5 years ago

~@rstanonik I'm not sure but the outline of the harvesting process might have some of the information you need: https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit#heading=h.caf5ayev1pza~

doh, just saw @lsitu already linked to that!
(I just updated the permissions though in case you couldn't access it before you should be able to now)

rstanonik commented 5 years ago

I'm trying to test by hand, so my question is does it expect input files in the path /pub/data2/damsmanager/dams-staging/rdcp-staging/rdcp-0126-cil or does it just produce output files there. I'd like to test on an empty directory first, but maybe I need to provide some input files there too?

Can you include a Gemfile, so that I can "bundle install" to ensure I get all of the needed gems.

Thanks

rstanonik commented 5 years ago

Once I have it running by hand, then I'll try to create a Dockerfile to create an image, but it can run just as a script in the interim.

lsitu commented 5 years ago

@rstanonik Thank you very much. I think the Ruby version doesn't matter but it's 2.4.1 in my MAC that ran the tests. The script just run the Ruby command to download the source files for CIL Harvesting and here are the list of gems that are used: require 'open-uri' require 'json' require 'byebug' require 'csv' require 'fileutils'

For directory /pub/data2/damsmanager/dams-staging/rdcp-staging/rdcp-0126-cil, it need to be created under rdcp-staging, which is used to stored the output files there.

rstanonik commented 5 years ago

How did you test?

rstanonik commented 5 years ago

I get this when I "bundle install" Could not find gem 'open-uri' in any of the gem sources listed in your Gemfile.

rstanonik commented 5 years ago

Apparently, open-uri is part of ruby, so isn't installed as a gem. https://stackoverflow.com/questions/20544662/unable-to-bundle-install-open-uri/20544806#20544806

rstanonik commented 5 years ago

When I run it, it creates the directory for today cil_harvest_2019-02-21/ cil_harvest_2019-02-21/metadata_processed cil_harvest_2019-02-21/content_files cil_harvest_2019-02-21/metadata_source

But fails with ... ../CIL_Public_Data_JSON/Version8_6/DATA/CIL_PUBLIC_DATA/41066.json ../CIL_Public_Data_JSON/Version8_6/DATA/CIL_PUBLIC_DATA/2775.json ../CIL_Public_Data_JSON/Version8_6/DATA/CIL_PUBLIC_DATA/35677.json ./git_changes.sh: line 62: ./cil_harvest_2019-02-21/metadata_processed/json_files.txt: No such file or directory /home/rstanonik/cil/cil-sync/cil_download.rb:15:in readlines': No such file or directory @ rb_sysopen - ./cil_harvest_2019-02-21/metadata_processed/json_files.txt (Errno::ENOENT) from /home/rstanonik/cil/cil-sync/cil_download.rb:15:in

'

As if it expects cil_harvest_2019-02-21/metadata_processed/json_files.txt to exist

lsitu commented 5 years ago

@rstanonik Yep, Ruby could include some of them. I just follow the steps in README ( https://github.com/ucsdlib/cil-sync). Here is the command I use to run it locally: ./git_changes.sh /Users/lsitu/Documents/git/CIL_Public_Data_JSON /Users/lsitu/Documents/git/cil-sync/rdcp-0126-cil /Users/lsitu/Documents/git/CIL_Public_Data_JSON: full path to CIL_Public_Data_JSON repository cloned from git clone git@github.com:slash-segmentation/CIL_Public_Data_JSON.git /Users/lsitu/Documents/git/cil-sync/rdcp-0126-cil: full path to the folder that stores the output files, which should be the full path to rdcp-staging/rdcp-0126-cil as required by the CIL Harvesting process.

lsitu commented 5 years ago

For the No such file or directory error you got, could you check whether the path to local CIL_Public_Data_JSON repository is correct and accessible?

rstanonik commented 5 years ago

The path seems correct and accessible.

Here's how I'm testing on my machine

[rstanonik@sandcrab cil]$ ls CIL_Public_Data_JSON cil-sync

[rstanonik@sandcrab cil]$ cd cil-sync/ [rstanonik@sandcrab cil-sync]$ ls cil_csv.rb cil_harvest_2019-02-21 Gemfile.lock LICENSE README.md cil_download.rb Gemfile git_changes.sh log.txt

[rstanonik@sandcrab cil-sync]$ ./git_changes.sh ../CIL_Public_Data_JSON .

rstanonik commented 5 years ago

It creates cil_harvest_2019-02-21

rstanonik commented 5 years ago

Which consists of empty subdirectories

[rstanonik@sandcrab cil-sync]$ find cil_harvest_2019-02-21/ -print cil_harvest_2019-02-21/ cil_harvest_2019-02-21/metadata_processed cil_harvest_2019-02-21/content_files cil_harvest_2019-02-21/metadata_source

rstanonik commented 5 years ago

When I run it with fully qualified paths, rather than relative paths, it now seems to be generating files

For example

./git_changes.sh /home/rstanonik/cil/CIL_Public_Data_JSON /home/rstanonik/cil

rstanonik commented 5 years ago

It's still running and has used 191GB so far. How big do you expect the result to be? How often should this run? Nightly? Will it download the same files each time or does it only download new files? Each time run it seems to create a directory cil_harvest_YYYY-MM-DD, where YYYY-MM-DD is today's date.

lsitu commented 5 years ago

@rstanonik I have no ideas regarding the the initial size of this big batch. Maybe @arwenhutt know more about it?

It's expected to run once a month. And I think run it any time after 3 o'clock in the morning will be fine since damsmanager will run nightly at 3 o'clock to pickup those new files added (https://github.com/ucsdlib/damsmanager/blob/master/src/edu/ucsd/library/xdre/utils/DAMSRoutineManager.java#L45). But just let us know if you want to adjust the schedule.

Yes, it'll only download those new files each time. And it will create a new directory like cil_harvest_YYYY-MM-DD to hold the files downloaded, where YYYY-MM-DD is today's date.

hjsyoo commented 5 years ago

@rstanonik I believe it's around 40TB. I'll double check with the db manager and let you know if this estimate is vastly off.

rstanonik commented 5 years ago

I've been running a test download to my local drive, but I'm going to stop that and start a download to rdcp-staging. After it has run a while, I'll have an estimate how long it will take. Judging from the speed to my local disk, I'd guess 40TB will take about 20 days. The download will not run on the DAMS machines, but on a machine with a faster connection to the isilon, although I suspect the download from the remote host will be the limiting factor. From where is the data coming?

rstanonik commented 5 years ago

Do you want this to download 24/7? Or to start and stop on some schedule such as 3am to 6am?

hjsyoo commented 5 years ago

@rstanonik Wow, okay. I believe this is a typical example location for the data: https://cildata.crbs.ucsd.edu/media/videos/10725/10725.jpg

rstanonik commented 5 years ago

Oh, so from on-campus.

lsitu commented 5 years ago

It's downloaded from https://cildata.crbs.ucsd.edu/. For example https://cildata.crbs.ucsd.edu/media/videos/10725/10725.jpg.

rstanonik commented 5 years ago

My rough estimate is that data moves about 8TB/day over a 1Gigabit/second connection, at best. If software is doing some processing of the data, that will slow it down. Data over an https connection will be slower. We'll see once it has run for a while.

rstanonik commented 5 years ago

@lsitu If something interrupts the download (network problem, machine reboot, etc) and I restart the download, it will figure out where it was and resume from there, rather than start from the beginning, right?