ghmo / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

XML import merges records together #137

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Imported a small Xml file but the project, attached, can not be displayed in 
the browser.  The browser complains of unresponsive script on various lines of 
project-bundle.js (lines 52, 29, 40 & 109 have been indicated at various 
stages).

The unresponsive script seems to be happening after the GET response for 
/scripts/views/data-table/column-header.html

Original issue reported on code.google.com by iainsproat on 22 Sep 2010 at 10:02

Attachments:

GoogleCodeExporter commented 8 years ago
Iain, I can't import the project attached. It appears to be corrupted. Could 
you export just the data?

Original comment by dfhu...@google.com on 28 Sep 2010 at 4:46

GoogleCodeExporter commented 8 years ago
Attached is the original data file I imported.

ClinicalTrials.gov only outputs in individual xml files (see issue 131), so I 
had an external script stitch them into a single xml file.

Original comment by iainsproat on 28 Sep 2010 at 8:20

Attachments:

GoogleCodeExporter commented 8 years ago
I can repro the bug, although I haven't figured out how to fix it. Exporting 
the data as HTML table works--all the records seem to be there. But inside 
Refine, only the first 2 records are shown. I guess the JSON stream got cut 
off? Which is weird because that would cause a syntax error.

Original comment by dfhu...@google.com on 28 Sep 2010 at 7:18

GoogleCodeExporter commented 8 years ago
Attached is a smaller version of the file which should make debugging a little 
easier.

Part of the problem was the multi-line text elements were confusing the parser, 
but I've just committed a change that should fix that piece.  It's still 
splitting a single record into multiple rows when it shouldn't be.  I'll take a 
look and see why.

As an aside, I think the reason the browser goes berserk is that it's trying to 
deal with a large number of null cells.  Fixing the project import will 
probably solve this, but someone who's got stronger Javascript fu than I might 
want to take a look at whether there's another problem lurking there.

Original comment by tfmorris on 27 Nov 2010 at 10:14

Attachments:

GoogleCodeExporter commented 8 years ago
I think I've figured out what the problem is here.  The column groups for this 
XML fragment are getting computed incorrectly:

   <sponsors>
      <lead_sponsor>
        <agency>National Center for Research Resources (NCRR)</agency>
        <agency_class>NIH</agency_class>
      </lead_sponsor>
      <collaborator>
        <agency>HRSA/Maternal and Child Health Bureau</agency>
        <agency_class>U.S. Fed</agency_class>
      </collaborator>
    </sponsors>

Refine is currently computing three column groups from this: lead_sponsor (2 
columns), collaborator (2 columns), and sponsor (all 4 columns).  This last 
group is triggering unnecessary row dependencies when there is no collaborator 
element.

I'm tempted to say that column groups which consist of nothing but other 
groups, without any individual ungrouped columns of their own should be 
eliminated, but it's a fairly critical piece of code, so I want to look at it a 
little more closely.

p.s. I suspect the reason for the sluggish browser performance is that it was 
trying to deal with the entire file at once since the monster second record 
effectively disabled the paging.

Original comment by tfmorris on 14 Oct 2011 at 6:49

Attachments:

GoogleCodeExporter commented 8 years ago
Reversing course, I now think that perhaps we should be computing row->record 
dependencies directly since we have that information from the parse.  We know 
what the path to the top level element is and, by definition, all rows that we 
create during the parse of its children are part of the same record.

Visually the column groups look like this:

Col #         2222333344445555
Sponsor group SSSSSSSSSSSSSSSS
Lead/Collab   CCCCccccLLLLllll

The three groups are group S, group Cc, and group Ll.Note that the columns have 
been reordered with the Collaborator columns before the Lead columns.  Group S 
is the problematic one.  Since the Collaborator sponsors are optional, that 
means that the "key" column #2 can be blank for a top level record, causing it 
to get merged with the previous record (the algorithm searches back for a row 
with a non-blank "key" cell if any of the cells in the column group are 
non-blank).

Group S is a real column group (consisting solely of other column groups), but 
since we don't appear to use column groups for anything, I'm not sure what 
value it has.

I think there are two options here:

1. Eliminate column groups which consist solely of other column groups from the 
dependency analysis

2. Compute row dependencies using a different method than column groups (eg use 
the tree structure directly from the parse).

Opinions?

Original comment by tfmorris on 27 Oct 2011 at 12:04

GoogleCodeExporter commented 8 years ago
This is fixed, at least well enough for this example, with r2347.  The cell 
data counts weren't getting updated before sorting the column groups, causing 
them all to be zero which meant data rich columns weren't getting place where 
Refine needed them to be for the key column.

I think there's still an underlying issue where the current counting strategy 
can get fooled into choosing the wrong column.  For example if Column A is 
mandatory but never has more than a single value and Column B values are 
optional, but high frequency (e.g. 50% of the records, but every record has 3 
values, giving it 1.5x the number of cells as column A), then column B will get 
chosen in preference to column A.

There's also the case where no single column in a column group always has a 
value (e.g. a column group where either column A OR column B is populated for 
any given record).

Original comment by tfmorris on 28 Oct 2011 at 7:56

GoogleCodeExporter commented 8 years ago
Reopening.  The theoretical issue that I suspected has been confirmed in the 
wild by a new XML file received from a commenter on issue 393, so we're going 
to need a different approach to dealing with column groups.

Original comment by tfmorris on 6 Nov 2011 at 9:14

GoogleCodeExporter commented 8 years ago
Issue 393 has been merged into this issue.

Original comment by tfmorris on 6 Nov 2011 at 9:26

GoogleCodeExporter commented 8 years ago
Bump all unfinished 2.5 Milestone tags to next release

Original comment by tfmorris on 12 Dec 2011 at 7:56

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Hello all, 

Outside user, working to clean-up someone else's messy server.  After importing 
three .csv files that total 5019 records in length, I delete the the 'file' 
column as it is not needed.  The record count then abates to 1333.  I do not 
have any facets selected.  When attempting click around and discover the source 
of the problem, Firefox tells me that I have an unresponsive script:

http://127.0.0.1:3333/project-bundle.js:7973

And asks me if I wish to continue or not.  If I continue, it will eventually 
ask me again.  If I stop the script, it will crash the tab I am running Refine 
in.  

I converted my original .csv files into .tsv format, and reopened them in 
Refine.  This solved my issue.  

Quirky little bug with a simple workaround for those of you out there in 
Userland that are not quite up to speed in programing languages. 

Original comment by emilytgriffiths on 13 Apr 2012 at 2:56

GoogleCodeExporter commented 8 years ago
Since you're working with CSV, not XML, it's clearly not related to this bug.  
Please feel free to post on Google Refine mailing list if you'd like 
assistance.  It sounds like you probably had a mostly blank initial column in 
one or more of your CSVs.  You can either process it in "row" mode (instead of 
"record" mode) or shuffle the columns around so it's not an issue.

Original comment by tfmorris on 13 Apr 2012 at 4:58

GoogleCodeExporter commented 8 years ago
Could someone provide a pointer on how does the record detection exactly work? 
Is it import-time, or runtime? Where can I find the code responsible for this? 
When I provide a properly structured xml, it merges some of the records into 
one, but I can't see any features that distinguish those records from the 
others... I'd like to look at the source of the problem, but currently have no 
idea where to look.

I'm attaching the offending xml.
Thanks

Original comment by libo...@gmail.com on 11 Jun 2012 at 11:24

Attachments:

GoogleCodeExporter commented 8 years ago
Clicking the rev above that I used to repair part of the problem (r2347) will 
get you in the right ballpark.  The XML and JSON importers are in the 
"tree-shaped" importer family.

ImportColumnGroup is the class which manages the column groups that are used to 
determine dependent rows/records.  Anything that references it is probably 
involved.  TreeImportUtilities and XmlImportUtilities have methods which are 
used with this.

Hope that helps get you started.  Let us know on the dev list if you have any 
questions.

Original comment by tfmorris on 12 Jun 2012 at 8:15