maphub / scripts

A collection of scripts to prepare seed data (metadata, maps) for the maphub portal
1 stars 2 forks source link

seed-data generation script #1

Open behas opened 12 years ago

behas commented 12 years ago

We need a script that generates a seed data file for the Library of Congress Maphub instance.

Each "map record" in the seed data file includes:

We already have the maps in place and scripts to download metadata from the LoC's GMD collection (see scripts directory). The script has to read these map identifiers, iterate over the harvested metadata records, identify matching records (based on the map identifier), and output a maphub map record for each match.

The challenging part of this script is to select the appropriate metadata fields from the OAI-PMH records. We want only those that carry "relevant" semantics about the map. Also some data cleansing (whitespace, special chars, etc.) steps might be necessary. At the end the metadata need to be indexed by Apache Solr / Lucene.

The results should be a script generate-loc-seeddata which takes the directory of map image files and a directory of XML files (= the metadata records) and a set of identifiers (probably a TXT file) as input and generates an outputfile loc-seeddata.yaml

Possible execution:

generate-loc-seeddata maps/ metadata/*.xml

generate-loc-seeddata -n 10 maps/ metadata/*.xml for only 10 maps

shionguha commented 12 years ago

working on harvesting DC metadata using fetch-metadata script and tag analysis right now.

shionguha commented 12 years ago

Possible issue with ruby on the server (aka am I missing something?). As I try to work with the ruby interpreter, it gives me the following message:

The program 'ruby' can be found in the following packages:

Currently, I am working on my machine and then sftp-ing to the server.

behas commented 12 years ago

RVM (https://rvm.beginrescueend.com/) is now installed and the problem is fixed:

maphub@samos:~$ ruby -v ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-linux]

In general, I recommend to develop on your local machine (using the same ruby version) and use the server only for running longer jobs

Best, Bernhard

On Mar 7, 2012, at 10:07 PM, Shion wrote:

Possible issue with ruby on the server (aka am I missing something?). As I try to work with the ruby interpreter, it gives me the following message:

The program 'ruby' can be found in the following packages:

Currently, I am working on my machine and then sftp-ing to the server.


Reply to this email directly or view it on GitHub: https://github.com/maphub/maphub-seeddata/issues/1#issuecomment-4383943


Bernhard Haslhofer Postdoc Associate Cornell Information Science 301 College Ave. Ithaca, NY 14850 WWW: http://www.cs.cornell.edu/~bh392/ Skype: bernhard.haslhofer

shionguha commented 12 years ago

Basic script has been uploaded along with a screenshot showing terminal run and the sample yaml output file. Waiting for bug fixing + server running...

shionguha commented 12 years ago

The script is run as follows:

ruby generate-loc-seeddata.rb -i mapdir -m metadatadir -n numofsamples

i.e. ruby generate-loc-seeddata.rb -i maps -m mods_metadata_8800 -n 170

behas commented 12 years ago

Cool, thank you! I will have a look at it later today and get back to you.


Bernhard Haslhofer Postdoc Associate Cornell Information Science 301 College Ave. Ithaca, NY 14850 WWW: http://www.cs.cornell.edu/~bh392/ Skype: bernhard.haslhofer

On Monday, April 9, 2012 at 12:49 AM, Shion wrote:

The script is run as follows:



---

Reply to this email directly or view it on GitHub:
https://github.com/maphub/maphub-seeddata/issues/1#issuecomment-5021010
behas commented 12 years ago

Hi Shion,

the code looks good. There is quite a bit of IO overhead, but on Friday I will give you some tips how you can fix this and speed up the script.

When I tried to execute the script on the server I had the following problem:

maphub@samos:~/maphub-seeddata/scripts$ ruby generate-loc-seeddata.rb -i ../../data/maps/ -m ../../data/metadata/ -n 5 Finding all the map files... Finding all the metadata files ... Creating output YAML File generate-loc-seeddata.rb:177:in <main>': undefined methodparent' for nil:NilClass (NoMethodError)

I will add this to the issues list, just to track progress.

It would also be great if you could summarize the usage of the three scripts in the README file. Please keep it short, just explain what the scripts do and how to use them.

Ad Github: avoid checking in non-source files (output, screenshots, etc.). Github is really just for code…

Thanks again and talk to you on Friday,

Bernhard


Bernhard Haslhofer Postdoc Associate Cornell Information Science 301 College Ave.
Ithaca, NY 14850 WWW: http://www.cs.cornell.edu/~bh392/ Skype: bernhard.haslhofer

On Monday, April 9, 2012 at 12:49 AM, Shion wrote:

The script is run as follows:



---

Reply to this email directly or view it on GitHub:
https://github.com/maphub/maphub-seeddata/issues/1#issuecomment-5021010
shionguha commented 12 years ago

Thanks. I just updated the README. Is this fine or were you thinking about something else?

behas commented 12 years ago

I made some changes to the README file and added some comments to the script code...


Bernhard Haslhofer Postdoc Associate Cornell Information Science 301 College Ave. Ithaca, NY 14850 WWW: http://www.cs.cornell.edu/~bh392/ Skype: bernhard.haslhofer

On Wednesday, April 11, 2012 at 4:29 PM, Shion wrote:

Thanks. I just updated the README. Is this fine or were you thinking about something else?


Reply to this email directly or view it on GitHub: https://github.com/maphub/maphub-seeddata/issues/1#issuecomment-5078183