sul-dlss-labs / ld4p-tracer-bullet-scripts

Symphony scripts that use and run the sul-dlss/ld4p-tracer-bullets
Other
0 stars 1 forks source link

Check identifiers in output .xml and .rdf file names #21

Open dazza-codes opened 7 years ago

dazza-codes commented 7 years ago

e.g. most identifiers beginning with it have a space in them, which is translated to _ in the file name, but this one for it15112659 does not:

2017-02-06T15:14:13-08:00  CONVERTED MARC-RDF file: /data/src/dlss/ld4l/ld4p_rdf/MarcRDF/it_15112640.rdf
2017-02-06T15:14:16-08:00  CONVERTED MARC-RDF file: /data/src/dlss/ld4l/ld4p_rdf/MarcRDF/it15112659.rdf
2017-02-06T15:14:18-08:00  CONVERTED MARC-RDF file: /data/src/dlss/ld4l/ld4p_rdf/MarcRDF/it_15112691.rdf
dlrueda commented 7 years ago

The history on this is that the Casalini identifiers used to be in the form "it xxxxxx" and now they are "itxxxxx" without the space.

So, yes, you'll see both forms. Does the converter have to translate the space to a _? If that's our only option, AND if it's going to be a problem if we need that to get back to the Symphony record, then we should probably (some day) clean up the Symphony records and get rid of the space.

dazza-codes commented 7 years ago

In general, a space in a unix filename is very awkward to work with because it always needs to be escaped and/or quoted. At present, the conversion from MARC to XML is writing out a file for each record and the 001 field was chosen as a way to output unique file names. Given that choice and the unix difficulty with spaces in file names, the 001 field is only modified by replacing a space with an underscore.

dazza-codes commented 7 years ago

Another slight anomaly - there is a record identifier with an x at the end from the casalini0.mrc file, i.e.

/ld4p_data/Dataload/LD4P/MarcXML/it_9932055x.xml

There might be other identifier anomalies.