Stray characters before <?xml> (including blank lines) causes XML import to fail

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Select an xml file with .xml file extension (see attached files)
2. use default "Advanced Options" (I also tried several other combinations with 
no luck)
3. Click the "Create Project" button

What is the expected output? What do you see instead?
* I expect to see a table of data with one row for each child xml entity of the 
root element.
* Instead I see "0 rows" and no data

What version of the product are you using? On what operating system?
* Google Refine Version 2.0 from google-refine-2.0-r1836.dmg
* Mac OS X 10.6.4 Snow Leopard

Please provide any additional information below.
* I have Gridworks installed as well
* nyt.xml is from feeds.nytimes.com/nyt/rss/HomePage

Original issue reported on code.google.com by chr...@gmail.com on 17 Nov 2010 at 5:09

Attachments:

[nyt - Google Refine.png](https://storage.googleapis.com/google-code-attachments/google-refine/issue-232/comment-0/nyt - Google Refine.png)
nyt.xml
pustakalaya-communities.xml
pustakalaya-resources.xml

GoogleCodeExporter commented 8 years ago

Thanks for the bug report. I can reproduce this, and will take a further look 
at what's going on.

Original comment by iainsproat on 17 Nov 2010 at 5:45

Changed state: Accepted
Added labels: Component-Logic

GoogleCodeExporter commented 8 years ago

The problem is:
[XmlImportUtilities] No candidate elements were found in data - at least 6 
similar elements are required (0ms)

I can count more than 6 <item> tags, so I don't think that's the issue.  It 
might be due to them being in a mixed element, and not getting counted 
correctly.

Original comment by iainsproat on 17 Nov 2010 at 6:26

GoogleCodeExporter commented 8 years ago

Do you have an example of an XML file that does work?

Original comment by chr...@gmail.com on 17 Nov 2010 at 7:07

GoogleCodeExporter commented 8 years ago

One problem with each of these files is that there are characters before the 
initial <?xml> This is causing the parser to choke immediately.  We should 
probably consider trimming leading whitespace, but you can get the files 
imported with the existing code by deleting the initial blank line.

Original comment by tfmorris on 27 Nov 2010 at 10:34

GoogleCodeExporter commented 8 years ago

thanks, that worked for me

Original comment by chr...@gmail.com on 28 Nov 2010 at 7:57

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 7 Jun 2011 at 5:59

Changed title: Stray characters before <?xml> (including blank lines) causes XML import to fail

GoogleCodeExporter commented 8 years ago

Fixed in r2246.

Original comment by tfmorris on 14 Oct 2011 at 10:27

Changed state: Fixed
Added labels: Milestone-2.5

GoogleCodeExporter commented 8 years ago

I'm having a similar problem, and downloaded RC2314, but that version will only 
import the first record of a fairly flat XML file. 

I've enclosed a sample.  I selected the first <entry1> record of the file as 
the first record.

Any suggestions?

Original comment by ron.ma...@gmail.com on 17 Nov 2011 at 4:48

Attachments:

ActivityLog.xml

GoogleCodeExporter commented 8 years ago

Hmm, the schema seems a bit odd.  You have uniquely numbered elements such as 
<entry1> <entry2> ... rather than just all of them being <entry>.  I used 
Refine's line-based import to pull it in and then did some text filtering and 
value.partition and replacing , and then exported to give you a cleaner XML 
file to work with.  Attached.  Does that contain all the records ? (I get 1239 
of them using latest Trunk version)

Original comment by thadguidry on 17 Nov 2011 at 5:09

Attachments:

ActivityLog_Cleaned.xml

GoogleCodeExporter commented 8 years ago

Thanks very much!

Unfortunately, this is output from software I don't write, so I am unable to 
make changes to the schema.

I'm just testing this out to see if Google Refine will let us prepare a report 
from this file.

I guess I'll have to clean it up manually each time I want to import it.

Thanks again!

Original comment by ron.ma...@gmail.com on 17 Nov 2011 at 5:20

GoogleCodeExporter commented 8 years ago

Probably best thing is use Notepad++ or whatever text editor you have and just 
use a Find/Replace using regex such as <Entry\d+> and </Entry\d+> and then 
replace those with the string <Entry>.  You could also create a python script 
to do that as part of a batch process, or if this is a constant feed process, 
perhaps use an ETL tool like Talend to pick up the files in a directory when 
they arrive and convert & clean them for later analysis in Refine.

Original comment by thadguidry on 17 Nov 2011 at 5:38

GoogleCodeExporter commented 8 years ago

Yes, thanks!  That'll save some time rather than doing it in Refine.

I appreciate this forum, everyone's so helpful!

Original comment by ron.ma...@gmail.com on 17 Nov 2011 at 5:41

leerssej / google-refine

Stray characters before <?xml> (including blank lines) causes XML import to fail #232