vndimitrova / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

Refine can't open XML files from US PTO #336

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

When I tried to import an XML file, download (anyone) from 
http://www.google.com/googlebooks/uspto-patents-grants-text.html, it did not 
represent all the fields. I even tried importing by not splitting into columns 
but it just showed three fields/columns. I am using version 2.0 of Google 
Refine on MAC OS.

What could possibly be the reason?

Please let me know.
Thanks.
Regards,
Amrapali

Original issue reported on code.google.com by amrapali...@gmail.com on 17 Feb 2011 at 4:16

GoogleCodeExporter commented 9 years ago
That page has files in three different formats, so "any" isn't very specific.  
I examined ipg050712.xml which is in XML format and found that it has no root 
element.  In other words, it looks like

<patent></patent>
<patent></patent>

rather than 

<patents>
  <patent></patent>
  <patent></patent>
</patents>

Google Refine will only handle XML files with a single root element, so you'll 
need to modify the files.  You'll end up with very large grids of cells which 
is likely to make your browser quite sluggish, so even after all this Refine 
might not be the best tool of choice, but it will import the files.

Original comment by tfmorris on 19 Nov 2011 at 12:04