solrmarc / stanford-solr-marc

This git repository has moved to https://github.com/sul-dlss/solrmarc-sw. No further commits will be made here.
13 stars 3 forks source link

MarcXmlParser XMLReader parse error when converting from MARCXML back to MARC21 #1

Open gypsyjoe opened 11 years ago

gypsyjoe commented 11 years ago

I am attempting to use marc4j to convert a MARCXML file back to MARC21 binary, which I had previously converted from MARC21 to MARCXML using marc4j. I made one update to some of the records in the MARCXML to add a single tag element for MARC tag 088 with a value of "OSTI-ID=#######" where the #'s are individual numeric digits. After making this update and then attempting to convert back to MARC21, I get a snag in the SAXParser that throws a NullPointerException. It breaks on a particular record.

I've attempted to fix this by pulling out the individual records into a DOM and getting each node then pulling the string out of the node and then converting the string to a byte array input stream to move it to an InputStream object and passing the single record to the MarcXmlReader object. But I get the following error for this record megta data.

Exception getting thrown: MarcXmlParser run() MarcException: Unable to parse input

XML Record causing the blow up:

<record>
<leader>02440nam  22002415  4500</leader>
<controlfield tag="001">382070</controlfield>
<controlfield tag="005">20120809064100.0</controlfield>
<controlfield tag="008">120605s2011    nmu      t    000 0 eng  </controlfield>
<datafield ind1=" " ind2=" " tag="027">
  <subfield code="a">SAND2011-8282</subfield>
</datafield>
<datafield tag="088">
  <subfield code="a">OSTI_ID=1095410</subfield>
</datafield>
<datafield ind1="0" ind2="0" tag="110">
  <subfield code="a">Sandia National Laboratories,</subfield>
  <subfield code="c">Livermore, CA</subfield>
</datafield>
<datafield ind1="1" ind2="0" tag="245">
  <subfield code="a">Including shielding effects in application of the TPCA method for detection of embedded radiation sources.</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="260">
  <subfield code="c">December 2011.</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="300">
  <subfield code="a">32 p.</subfield>
</datafield>
<datafield ind1="0" ind2=" " tag="355">
  <subfield code="a">Unclassified</subfield>
  <subfield code="b">Unlimited Release</subfield>
</datafield>
<datafield ind1="0" ind2=" " tag="520">
  <subfield code="a">Conventional full spectrum gamma spectroscopic analysis has the objective of quantitative identification of all the radionuclides present in a measurement. For low-energy resolution detectors such as NaI, when photopeaks alone are not sufficient for complete isotopic identification, such analysis requires template spectra for all the radionuclides present in the measurement. When many radionuclides are present it is difficult to make the correct identification and this process often requires many attempts to obtain a statistically valid solution by highly skilled spectroscopists. A previous report investigated using the targeted principal component analysis method (TPCA) for detection of embedded sources for RPM applications. This method uses spatial/temporal information from multiple spectral measurements to test the hypothesis of the presence of a target spectrum of interest in these measurements without the need to identify all the other radionuclides present. The previous analysis showed that the TPCA method has significant potential for automated detection of target radionuclides of interest, but did not include the effects of shielding. This report complements the previous analysis by including the effects of spectral distortion due to shielding effects for the same problem of detection of embedded sources. Two examples, one with one target radionuclide and the other with two, show that the TPCA method can successfully detect shielded targets in the presence of many other radionuclides. The shielding parameters are determined as part of the optimization process using interpolation of library spectra that are defined on a 2D grid of atomic numbers and areal densities.</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="536">
  <subfield code="a">USDOE</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="541">
  <subfield code="c">Available from NTIS.</subfield>
</datafield>
<datafield ind1="0" ind2=" " tag="700">
  <subfield code="a">Shokair, Isaac R.</subfield>
</datafield>
<datafield ind1="0" ind2=" " tag="700">
  <subfield code="a">Johnson, William C.</subfield>
</datafield>
<datafield ind1="4" ind2="1" tag="856">
  <subfield code="u">http://prod.sandia.gov/sand_doc/2011/118282.pdf</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="919">
  <subfield code="a">SAND Report</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="919">
  <subfield code="a">Electronic Resource</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="995">
  <subfield code="a">Public</subfield>
</datafield>
</record>

MarcXmlParser run() MarcException: Unable to parse input I would greatly appreciate it if someone could help me figure out why this record XML is flipping out the MarcXmlParser.parse function. It seems to be blowing up when the SAXParserFactory XMLReader attempts to parse the record. I'm even passing the node string through a normalizer like this to make sure it's valid ASCII text.

szxmlnode = Normalizer.normalize(szxmlnode, Normalizer.Form.NFD).replaceAll("[^\p{ASCII}]", "");

Joe Justice Sandia National Laboratories Albuqueruque, New Mexico

gypsyjoe commented 11 years ago

Ugh! The XML in my text looks like crap on this page. If whomever is available to help with this will email me (jhjusti@sandia.gov), I'll email you the XML causing the blow up. (I've been burning weeks on this problem and it continues to stymie me.)

gypsyjoe commented 11 years ago

Actually, if you can see the saved text by selecting to edit this issue, it seems to have saved the XML text. But if it will help, I'm happy to email it to whomever attempts to analyze it.

Joe Justice Sandia National Laboratories Albuqueruque, New Mexico

sesuncedu commented 11 years ago

Can you submit this as an issue under marc4j, with the binary and xml versions

gypsyjoe commented 11 years ago

Yes. I guess I was in the wrong place. Sorry. ☺

-joe

From: Simon Spero [mailto:notifications@github.com] Sent: Thursday, February 28, 2013 11:55 AM To: solrmarc/stanford-solr-marc Cc: Justice II, Joe H. Subject: [EXTERNAL] Re: [stanford-solr-marc] MarcXmlParser XMLReader parse error when converting from MARCXML back to MARC21 (#1)

Can you submit this as an issue under marc4j, with the binary and xml versions

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14250219.

sesuncedu commented 11 years ago

Also, if you have stacktrace that is good too, but sample code is good too

Simon

On Thu, Feb 28, 2013 at 2:08 PM, gypsyjoe notifications@github.com wrote:

Yes. I guess I was in the wrong place. Sorry. ☺

-joe

From: Simon Spero [mailto:notifications@github.com] Sent: Thursday, February 28, 2013 11:55 AM To: solrmarc/stanford-solr-marc Cc: Justice II, Joe H. Subject: [EXTERNAL] Re: [stanford-solr-marc] MarcXmlParser XMLReader parse error when converting from MARCXML back to MARC21 (#1)

Can you submit this as an issue under marc4j, with the binary and xml versions

— Reply to this email directly or view it on GitHub< https://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14250219>.

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14251063 .

gypsyjoe commented 11 years ago

I will send you the process I’m working through because I have several steps that are going on getting me to this point. I can include the original binary MARC21 from which this MARCXML is coming, but, as I cannot convert the MARCXML I sent you, I cannot send you any binary MARC of that step. I’ll do my best to describe what’s going on in my item. But I am able to convert some records that are not included here. I’ll include those files, too, and describe them.

I should have it ready soon. Thanks.

-joe

From: Simon Spero [mailto:notifications@github.com] Sent: Thursday, February 28, 2013 12:11 PM To: solrmarc/stanford-solr-marc Cc: Justice II, Joe H. Subject: [EXTERNAL] Re: [stanford-solr-marc] MarcXmlParser XMLReader parse error when converting from MARCXML back to MARC21 (#1)

Also, if you have stacktrace that is good too, but sample code is good too

Simon

On Thu, Feb 28, 2013 at 2:08 PM, gypsyjoe notifications@github.com<mailto:notifications@github.com> wrote:

Yes. I guess I was in the wrong place. Sorry. ☺

-joe

From: Simon Spero [mailto:notifications@github.com] Sent: Thursday, February 28, 2013 11:55 AM To: solrmarc/stanford-solr-marc Cc: Justice II, Joe H. Subject: [EXTERNAL] Re: [stanford-solr-marc] MarcXmlParser XMLReader parse error when converting from MARCXML back to MARC21 (#1)

Can you submit this as an issue under marc4j, with the binary and xml versions

— Reply to this email directly or view it on GitHub< https://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14250219>.

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14251063 .

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14251202.

gypsyjoe commented 11 years ago

How do I attach the files to the issue? It’s issue #26. Here’s the zip of the files I wanted to attach. But I can’t figure out who to do it on the site.

-joe

From: Simon Spero [mailto:notifications@github.com] Sent: Thursday, February 28, 2013 12:11 PM To: solrmarc/stanford-solr-marc Cc: Justice II, Joe H. Subject: [EXTERNAL] Re: [stanford-solr-marc] MarcXmlParser XMLReader parse error when converting from MARCXML back to MARC21 (#1)

Also, if you have stacktrace that is good too, but sample code is good too

Simon

On Thu, Feb 28, 2013 at 2:08 PM, gypsyjoe notifications@github.com<mailto:notifications@github.com> wrote:

Yes. I guess I was in the wrong place. Sorry. ☺

-joe

From: Simon Spero [mailto:notifications@github.com] Sent: Thursday, February 28, 2013 11:55 AM To: solrmarc/stanford-solr-marc Cc: Justice II, Joe H. Subject: [EXTERNAL] Re: [stanford-solr-marc] MarcXmlParser XMLReader parse error when converting from MARCXML back to MARC21 (#1)

Can you submit this as an issue under marc4j, with the binary and xml versions

— Reply to this email directly or view it on GitHub< https://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14250219>.

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14251063 .

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14251202.

haschart commented 11 years ago

Looking at the MARCXML record above the field you add:

<datafield tag="088">
  <subfield code="a">OSTI_ID=1095410</subfield>
</datafield>

is missing the marc indicators, (the ind1 and ind2 attributes)

if you change the added datafield to be:

<datafield ind1=" " ind2=" " tag="088">
  <subfield code="a">OSTI_ID=1095410</subfield>
</datafield>

it should parse correctly and produce a valid marc8 encoded binary MARc record after conversion.

sesuncedu commented 11 years ago

Still ought to be handled more gracefully than an NPE.

I was about to split the Reader and Writer Tests on a per class basis, so this is good excuse.

Simon

On Sat, Mar 2, 2013 at 3:42 PM, haschart notifications@github.com wrote:

Looking at the MARCXML record above the field you add:

OSTI_ID=1095410

is missing the marc indicators, (the ind1 and ind2 attributes)

if you change the added datafield to be:

<datafield ind1=" " ind2=" " tag="088"> <subfield code="a">OSTI_ID=1095410</subfield> </datafield>

it should parse correctly and produce a valid marc8 encoded binary MARc record after conversion.

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14335296 .

gypsyjoe commented 11 years ago

Cool! Let me know if I may be of help or if you have any questions. I'm sure I could forward the DOM code showing how I'm doing things there.

Honestly, I've been banging at this since before Code4Lib and it has been through all sorts of rewrites and attempts to comb out the problem. My latest thought is to pull in the marc4j project code into my servlet code so I can step through the marc4j processes and examine them more completely. But I wasn't able to finish this set up on Friday.

Good luck. I'm burning a candle for us. :-)

-joe

Sent from my iPad

On Mar 2, 2013, at 3:49 PM, "Simon Spero" notifications@github.com<mailto:notifications@github.com> wrote:

Still ought to be handled more gracefully than an NPE.

I was about to split the Reader and Writer Tests on a per class basis, so this is good excuse.

Simon

On Sat, Mar 2, 2013 at 3:42 PM, haschart notifications@github.com<mailto:notifications@github.com> wrote:

Looking at the MARCXML record above the field you add:

OSTI_ID=1095410

is missing the marc indicators, (the ind1 and ind2 attributes)

if you change the added datafield to be:

<datafield ind1=" " ind2=" " tag="088"> <subfield code="a">OSTI_ID=1095410</subfield> </datafield>

it should parse correctly and produce a valid marc8 encoded binary MARc record after conversion.

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14335296 .

— Reply to this email directly or view it on GitHubhttps://github.com/solrmarc/stanford-solr-marc/issues/1#issuecomment-14337609.