Open arwenhutt opened 6 years ago
@arwenhutt Can you give me the permission to the share doc Mapping for creation of subject heading ingest file and object level record ingest file so that I can take a look? Thanks.
@lsitu done!
Thanks @arwenhutt. I am able to see it now.
@arwenhutt Could you show me an example for row#75 Create a new row for child component for json file
in Mapping for creation of subject heading ingest file and object level record ingest file? Thanks.
@lsitu In other words, we want to add a final child component to the object with the source metadata (the json file itself) attached as a file. We aren't mapping all of the source metadata into the dams so this gives a user access to the full source metadata in case they need it.
Let me know if you would still like an example ingest file, I can do that but this week is very busy and I'm not sure when I'll be able to get to it!
@arwenhutt I see. Thanks.
@arwenhutt Are there any identifiers in the CIL object that can be used to check whether a JSON file (object) is ingested or not?
For date:creation mapping (row#7-9) in [Mapping for creation of subject heading ingest file and object level record ingest file],(https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337), is three CIL_CCDB.CIL.CORE.ATTRIBUTION.DATE
values in an array or just one? Could you show me an example? Thanks.
Hi @lsitu - I can help with the 2nd part, I'm not clear on the 1st part of the question re: identifiers in the object.
For the dates, I believe it can be an array with 1-m values.
I did a quick grep in the github repo grep -iRl "DATE" ./
to find some examples.
Here's one: Version8_6/DATA/CIL_PUBLIC_DATA/40685.json
I'm not sure that answers your question or not.
@lsitu I'm not sure I understand your first question either... is this for step
- Content for harvesting identified. Conditions: a. Not already harvested
from the Outline of Ongoing Harvesting processes document? https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit#heading=h.caf5ayev1pza
@arwenhutt Yes, I was thinking about the step above. In this case, we may need an identifier with the file name. At this time, since we'll download the new JSON source files from github https://github.com/slash-segmentation/CIL_Public_Data_JSON and place them in a folder with a date like cil_harvest_[YYYY-MM-DD]
, I think it should work to just do a nightly process to convert all those JSON source files at one time. But while moving forward, I am not sure whether there will be any exceptions or not.
Also, the mapping for Note:description
in line#43 (CIL_CCDB.CIL.CORE.IMAGEDESCRIPTION
) should be CIL_CCDB.CIL.CORE.IMAGEDESCRIPTION.free_text
(see https://github.com/slash-segmentation/CIL_Public_Data_JSON/blob/master/Version8_6/DATA/CIL_PUBLIC_DATA/10001.json#L125). Is it correct?
@lsitu yes, you are correct! since this identifier applies to the object as a whole, I've added it to the mapping as identifier:samplenumber
(line 11)
And yes, you are correct - CIL_CCDB.CIL.CORE.IMAGEDESCRIPTION
should be CIL_CCDB.CIL.CORE.IMAGEDESCRIPTION.free_text
. I've edited that in the Object Level Record ingest file sheet.
Thanks!
@arwenhutt I think the work to support CIL Harvesting is done in the following related tickets:
Do you think how often we should run a cron job to check for new JSON source files added on git hub repository https://github.com/slash-segmentation/CIL_Public_Data_JSON?
Could you give more specifically regarding item 6. Notification of harvest completion sent to [DOMM, JIRA, and Ho Jung]
in https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit#heading=h.caf5ayev1pza? I think we can sent out emails once the work for CIL Harvesting is done.
@lsitu Is this ticket open because you are waiting on @arwenhutt?
@gamontoya Yes. We need input from @arwenhutt . See https://github.com/ucsdlib/damsmanager/issues/289#issuecomment-450940194 above.
@lsitu re: your questions, I checked with @hjsyoo and we think: 1) Chron Frequency - monthly 2) harvest notification - email is good, sent to @hjsyoo and @arwenhutt
@mcritchlow / @arwenhutt I've added PR https://github.com/ucsdlib/damsmanager/pull/304 for emails notification and PR https://github.com/ucsdlib/private_config/pull/12 for the configuration we need in damsamanager. Both PRs are ready for review now. And ticket https://github.com/ucsdlib/cil-sync/issues/2 is created for the monthly cron job we need for CIL harvesting.
@gamontoya Can you update me on the status of this work? Is there anything for me or @arwenhutt to review, or are there some dependencies that need to be addressed first?
@hjsyoo I think @lsitu can confirm, but I think this work is complete.
@gamontoya / @hjsyoo Yes, the work is done. We just need to setup a cron job to run it: Create cron job for CIL syn. @rstanonik / @mcritchlow Are we going to setup Kubernetes CronJob for it? Thanks.
If this should run in kubernetes (or even docker), then are you going to create a docker image?
I looked at the "Create cron job" link. The argument to running the code looks like
/pub/data2/damsmanager/dams-staging/rdcp-staging/rdcp-0126-cil
If this runs as a cron job, does it always use the same argument "rdcp-0126-cil"?
@rstanonik From the documentation in 1. Outline of Ongoing Harvesting processes in the description text, I think we always pass the same path /pub/data2/damsmanager/dams-staging/rdcp-staging/rdcp-0126-cil
to rdcp-staging to store the results.
Yes, I think we can create a docker file when needed. Do you have an example of the docker image that works for kubernetes? Thanks.
Take a look at how Vivian set up ldapsync
She creates a docker image and pushes it to gitlab's repository. Then docker or kubernetes can pull it.
But maybe it should actually go under
so it doesn't violate the gitlab academic license.
I'm trying to run this by hand and it fails because "No such file or directory @ rb_sysopen cil_harvest_2019-02-21/metadata_processed/json_files.txt".
What input files does it expect to exist?
What output files will it create?
Should this be deployed first to dev or staging?
What ruby version?
~@rstanonik I'm not sure but the outline of the harvesting process might have some of the information you need: https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit#heading=h.caf5ayev1pza~
doh, just saw @lsitu already linked to that!
(I just updated the permissions though in case you couldn't access it before you should be able to now)
I'm trying to test by hand, so my question is does it expect input files in the path /pub/data2/damsmanager/dams-staging/rdcp-staging/rdcp-0126-cil or does it just produce output files there. I'd like to test on an empty directory first, but maybe I need to provide some input files there too?
Can you include a Gemfile, so that I can "bundle install" to ensure I get all of the needed gems.
Thanks
Once I have it running by hand, then I'll try to create a Dockerfile to create an image, but it can run just as a script in the interim.
@rstanonik Thank you very much. I think the Ruby version doesn't matter but it's 2.4.1 in my MAC that ran the tests. The script just run the Ruby
command to download the source files for CIL Harvesting and here are the list of gems that are used:
require 'open-uri'
require 'json'
require 'byebug'
require 'csv'
require 'fileutils'
For directory /pub/data2/damsmanager/dams-staging/rdcp-staging/rdcp-0126-cil
, it need to be created under rdcp-staging
, which is used to stored the output files there.
How did you test?
I get this when I "bundle install" Could not find gem 'open-uri' in any of the gem sources listed in your Gemfile.
Apparently, open-uri is part of ruby, so isn't installed as a gem. https://stackoverflow.com/questions/20544662/unable-to-bundle-install-open-uri/20544806#20544806
When I run it, it creates the directory for today cil_harvest_2019-02-21/ cil_harvest_2019-02-21/metadata_processed cil_harvest_2019-02-21/content_files cil_harvest_2019-02-21/metadata_source
But fails with
...
../CIL_Public_Data_JSON/Version8_6/DATA/CIL_PUBLIC_DATA/41066.json
../CIL_Public_Data_JSON/Version8_6/DATA/CIL_PUBLIC_DATA/2775.json
../CIL_Public_Data_JSON/Version8_6/DATA/CIL_PUBLIC_DATA/35677.json
./git_changes.sh: line 62: ./cil_harvest_2019-02-21/metadata_processed/json_files.txt: No such file or directory
/home/rstanonik/cil/cil-sync/cil_download.rb:15:in readlines': No such file or directory @ rb_sysopen - ./cil_harvest_2019-02-21/metadata_processed/json_files.txt (Errno::ENOENT) from /home/rstanonik/cil/cil-sync/cil_download.rb:15:in
As if it expects cil_harvest_2019-02-21/metadata_processed/json_files.txt to exist
@rstanonik Yep, Ruby could include some of them. I just follow the steps in README ( https://github.com/ucsdlib/cil-sync). Here is the command I use to run it locally:
./git_changes.sh /Users/lsitu/Documents/git/CIL_Public_Data_JSON /Users/lsitu/Documents/git/cil-sync/rdcp-0126-cil
/Users/lsitu/Documents/git/CIL_Public_Data_JSON
: full path to CIL_Public_Data_JSON repository cloned from git clone git@github.com:slash-segmentation/CIL_Public_Data_JSON.git
/Users/lsitu/Documents/git/cil-sync/rdcp-0126-cil
: full path to the folder that stores the output files, which should be the full path to rdcp-staging/rdcp-0126-cil
as required by the CIL Harvesting process.
For the No such file or directory
error you got, could you check whether the path to local CIL_Public_Data_JSON
repository is correct and accessible?
The path seems correct and accessible.
Here's how I'm testing on my machine
[rstanonik@sandcrab cil]$ ls CIL_Public_Data_JSON cil-sync
[rstanonik@sandcrab cil]$ cd cil-sync/ [rstanonik@sandcrab cil-sync]$ ls cil_csv.rb cil_harvest_2019-02-21 Gemfile.lock LICENSE README.md cil_download.rb Gemfile git_changes.sh log.txt
[rstanonik@sandcrab cil-sync]$ ./git_changes.sh ../CIL_Public_Data_JSON .
It creates cil_harvest_2019-02-21
Which consists of empty subdirectories
[rstanonik@sandcrab cil-sync]$ find cil_harvest_2019-02-21/ -print cil_harvest_2019-02-21/ cil_harvest_2019-02-21/metadata_processed cil_harvest_2019-02-21/content_files cil_harvest_2019-02-21/metadata_source
When I run it with fully qualified paths, rather than relative paths, it now seems to be generating files
For example
./git_changes.sh /home/rstanonik/cil/CIL_Public_Data_JSON /home/rstanonik/cil
It's still running and has used 191GB so far. How big do you expect the result to be? How often should this run? Nightly? Will it download the same files each time or does it only download new files? Each time run it seems to create a directory cil_harvest_YYYY-MM-DD, where YYYY-MM-DD is today's date.
@rstanonik I have no ideas regarding the the initial size of this big batch. Maybe @arwenhutt know more about it?
It's expected to run once a month. And I think run it any time after 3 o'clock in the morning will be fine since damsmanager will run nightly at 3 o'clock to pickup those new files added (https://github.com/ucsdlib/damsmanager/blob/master/src/edu/ucsd/library/xdre/utils/DAMSRoutineManager.java#L45). But just let us know if you want to adjust the schedule.
Yes, it'll only download those new files each time. And it will create a new directory like cil_harvest_YYYY-MM-DD to hold the files downloaded, where YYYY-MM-DD is today's date.
@rstanonik I believe it's around 40TB. I'll double check with the db manager and let you know if this estimate is vastly off.
I've been running a test download to my local drive, but I'm going to stop that and start a download to rdcp-staging. After it has run a while, I'll have an estimate how long it will take. Judging from the speed to my local disk, I'd guess 40TB will take about 20 days. The download will not run on the DAMS machines, but on a machine with a faster connection to the isilon, although I suspect the download from the remote host will be the limiting factor. From where is the data coming?
Do you want this to download 24/7? Or to start and stop on some schedule such as 3am to 6am?
@rstanonik Wow, okay. I believe this is a typical example location for the data: https://cildata.crbs.ucsd.edu/media/videos/10725/10725.jpg
Oh, so from on-campus.
It's downloaded from https://cildata.crbs.ucsd.edu/
. For example https://cildata.crbs.ucsd.edu/media/videos/10725/10725.jpg.
My rough estimate is that data moves about 8TB/day over a 1Gigabit/second connection, at best. If software is doing some processing of the data, that will slow it down. Data over an https connection will be slower. We'll see once it has run for a while.
@lsitu If something interrupts the download (network problem, machine reboot, etc) and I restart the download, it will figure out where it was and resume from there, rather than start from the beginning, right?
Descriptive summary
Write scripts for regular harvesting and transformation of content (files and metadata) from the Cell Image Library GitHub repository: https://github.com/slash-segmentation/CIL_Public_Data_JSON
Documentation
Related work