Closed gmoore016 closed 4 years ago
Thanks for pointing this out :) The easy fix would have been simply to repeat the if xml_doc and not xml_doc[1].startswith("<!DOCTYPE sequence-cwu")
test on the final yield
, but I took the opportunity to make it a whitelist check instead of a blacklist check, which is more generic and just a much better approach altogether. I was never happy with that bit in the first place.
Cheers :)
Hi @simonwiles --
Hope you're doing well! Please take your time with this, it seems to be a strange edge case. Mostly I just wanted to flag it so there was a record somewhere.
When processing ipg150203.zip, we get this error:
When I re-run with verbose output, it turns out it's getting caught on a
sequence-cwu
object, raising an error because it has no primary key. Per these lines: https://github.com/sul-cidr/patent_data_extractor/blob/d8c0f59cb10e310ebcb88e55e0c83b9848ac87db/patent_xml_to_csv.py#L196-L199 I believe it usually skips them, but for some reason this one seems to be a sticking point. On closer inspection of the file, it appears the particular sequence it's getting caught on is the absolute final record in the file, which suggests to me the skip functionality may be failing on the final document.If you have a chance, let me know if you see a simple fix! Otherwise I can just comment out the assertion for now and move on.