sul-cidr / patent_data_extractor

1 stars 3 forks source link

Ending on a Sequence #11

Closed gmoore016 closed 4 years ago

gmoore016 commented 4 years ago

Hi @simonwiles --

Hope you're doing well! Please take your time with this, it seems to be a strange edge case. Mostly I just wanted to flag it so there was a record somewhere.

When processing ipg150203.zip, we get this error:

Traceback (most recent call last):
File "/oak/stanford/groups/hlwill/gsmoore/projects/patent_data_extractor/patent_xml_to_csv.py", line 360, in convert
self.process_doc(doc)
File "/oak/stanford/groups/hlwill/gsmoore/projects/patent_data_extractor/patent_xml_to_csv.py", line 343, in process_doc
self.process_path(tree, path, config, {})
File "/oak/stanford/groups/hlwill/gsmoore/projects/patent_data_extractor/patent_xml_to_csv.py", line 334, in process_path
self.process_field(elems, tree, path, config, record, parent_entity, parent_pk)
File "/oak/stanford/groups/hlwill/gsmoore/projects/patent_data_extractor/patent_xml_to_csv.py", line 284, in process_field
self.process_new_entity(tree, elems, config, parent_entity, parent_pk)
File "/oak/stanford/groups/hlwill/gsmoore/projects/patent_data_extractor/patent_xml_to_csv.py", line 241, in process_new_entity
pk = self.get_pk(tree, config)
File "/oak/stanford/groups/hlwill/gsmoore/projects/patent_data_extractor/patent_xml_to_csv.py", line 227, in get_pk
assert len(elems) == 1
AssertionError

When I re-run with verbose output, it turns out it's getting caught on a sequence-cwu object, raising an error because it has no primary key. Per these lines: https://github.com/sul-cidr/patent_data_extractor/blob/d8c0f59cb10e310ebcb88e55e0c83b9848ac87db/patent_xml_to_csv.py#L196-L199 I believe it usually skips them, but for some reason this one seems to be a sticking point. On closer inspection of the file, it appears the particular sequence it's getting caught on is the absolute final record in the file, which suggests to me the skip functionality may be failing on the final document.

If you have a chance, let me know if you see a simple fix! Otherwise I can just comment out the assertion for now and move on.

simonwiles commented 4 years ago

Thanks for pointing this out :) The easy fix would have been simply to repeat the if xml_doc and not xml_doc[1].startswith("<!DOCTYPE sequence-cwu") test on the final yield, but I took the opportunity to make it a whitelist check instead of a blacklist check, which is more generic and just a much better approach altogether. I was never happy with that bit in the first place.

Cheers :)