hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
7 stars 7 forks source link

Provide open data dump of Alma MARC-XML data #1741

Open dr0i opened 1 year ago

dr0i commented 1 year ago

We cannot just expose the Alma dump because since there are local (IZ) fields we have to suppress. Since https://github.com/hbz/lobid-resources/issues/1687 we have suppressed these so that in lobid-resources we have only Open Data. To provide an Open Data MARC-XML dump we have to filter these fields from the MARC-XML and provide the result again as bgzf MARC-XML file analog to https://lobid.org/download/dumps/DE-605/mabxml/ under https://lobid.org/download/dumps/DE-605/marcxml/ .

See also https://github.com/hbz/lobid-resources/issues/1316.

blackwinter commented 1 year ago

provide the result again as bgzf MARC-XML file

Just for the record: BGZF was chosen to allow for random access to individual records (in combination with an index file); if you don't intend to support this use case, you might want to choose a more common compression format.

dr0i commented 1 year ago

You are right - we should test running time for creation of the file and the resulting file size and decide which format , e.g. tar.gz (as with the MAB-XML dump) or tar.bz2 or tar.xz - the latter should be best choosen, or what would you suggest @blackwinter ?

blackwinter commented 1 year ago

As this is a single file, I wouldn't create a tar archive. I would probably just go with gzip or bzip2 due to ubiquity.