pattersonkl / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

2.1 UX: Create Project Flow, Step 2 #285

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Step 2 allows users to preview data with default parsing applied and make any 
necessary changes.  possible formats include:

Google Spreadsheets
Delimiter-separated (TSV/CSV)
Fixed-width text
JSON
XML
Excel
Extension: Choose...

parse method selector is hidden unless the user asks for it, focusing attention 
on method specific import options.

Fixed-width has a column identification step, and JSON and XML have tree subset 
selection steps.

Original issue reported on code.google.com by jamesh...@google.com on 15 Dec 2010 at 2:41

Attachments:

GoogleCodeExporter commented 8 years ago
I would think Ignore Quotation Marks and Encoding should be options all the 
time ?  What do others think ?

Original comment by thadguidry on 15 Dec 2010 at 3:40

GoogleCodeExporter commented 8 years ago
Wow!

Particularly, excited to see the 'create rows from tag' option for JSON and 
XML, that nicely removes the detectPath method from the JSON & XML Importers 
(this method required the entire dataset to be loaded twice over).

I'd definitely agree that Encoding should be a parameter for all importers.  
And a user definable quotation mark parameter would be a nice-to-have.

Original comment by iainsproat on 15 Dec 2010 at 10:08

GoogleCodeExporter commented 8 years ago
From Freebase MQL reference:
JSON itself supports 32-bit, 16-bit and 8-bit encodings of Unicode text. 
Metaweb, however, requires the 8-bit UTF-8 encoding.

Is it still true that only UTF-8 encoding is accepted into the graph ? or is 
the data loaded with Refinery converted ?

Original comment by thadguidry on 15 Dec 2010 at 3:36

GoogleCodeExporter commented 8 years ago
Random thoughts from top to bottom:

6-refine-import-r1.png:

- *love* the ability to switch the encoding with direct visual feedback at 
parse time... unfortunately, the encoding issues might be past the first few 
rows, so it would be useful at least to be able to paginate thru a more 
substantial part of the dataset

- *love* the checkbox next to the header field... I find myself putting 0 there 
all the time, which requires an extra mouse->keyboard movement, this is so much 
better

- might be worth indicating that "auto-detect types" has a performance penalty 
right there

- do the top "next" and bottom "next" buttons do the same thing? one might 
mistake the one above for a paginating option

7-refine-import-r1.png:

 - I wonder if 6-refine-import-r1.png should really always look like 7-refine-import-r1.png... the benefit there would be to show off all the stuff that refine can import data from, which is not exposed unless you decide to change parsing method (and some people might not even know what parsing means!)

"fixed-length workflow": 

- not sure how understand how it's supposed to work, you click on the 'down 
arrow', that creates a cursor that you then move around horizontally to the 
desired location? or you just click on the bar at the top to select where the 
breakpoint is and the down triangle has a menu with various other options?

"JSON"

- "click on the JSON row" -> brilliant. 

"XML"

 - in a project from a past life (SIMILE Gadget) one thing I did that turned out to be very useful when loading and understanding big quantities of XML that others gave you was the ability to visualize the "skeleton" of the entire tree. See the screenshot at [http://simile.mit.edu/wiki/Gadget]. Basically it only shows the paths and uses sparklines to try to give you hints on what kind of data that is and how you should interpret it. Might be overkill for this stage, but I thought I would mention it as it might give ideas.

 - the only tricky difference between JSON and XML is that XML can contain mixed content while JSON never can. For mixed content I mean something like <a>this and that</a>. Parsing this is tricky because you don't know in what column to put the "this " string. Sometimes it's entirely possible to have structured xml with mixed content xml fragments embedded in it (for example, Atom with included XHTML fragment). I think Refine should provide some guidance there, in case the user selects an element as key that contains mixed content directly.

In any case, don't let the number of comments spoil the fact that this is an 
amazing job.

Original comment by stefa...@google.com on 15 Dec 2010 at 8:18

GoogleCodeExporter commented 8 years ago
Quotation Marks and Encoding: David had specific thoughts about why some 
formats shouldn't have these. they aren't clear enough to me to summarize, so 
let's wait and hear from him on this.  they're easy to add if that's the right 
thing.

Quotation Mark Parameter: interesting idea.  easy to add from a UI standpoint, 
less sure about implementation.

Pagination: let's discuss with David how much complexity this adds - I'm not 
fundamentally opposed to it here.

Auto-Detect Types warning: makes sense

Next buttons: they do the same thing. let's see where we get to with pagination 
and we'll see if we want to tweak this somehow.

Showing/Hiding Parse Formats: yeah, I'm of two minds about this.  always 
showing them felt like a lot of cognitive overhead to introduce when we expect 
in most cases to guess the parsing correctly.  how often do we expect to get it 
right?  either way, I'm open to always showing - what do other folks think?

Fixed Length Workflow: they aren't down arrows, they work like tab stops in a 
word processor, but agreed that it's problematic how similar those two kinds of 
elements are.  you drag arrows from the well horizontally into position.  it's 
hard to convey in a static image, but I think the down arrow confusion is a 
valid issue regardless and I'll look at some different icon for this.

XML: happy to get more complex with this as necessary, and Gadget is super 
cool. this already seems pretty complex for a 2.1, so I'll wait for some 
guidance from whoever is implementing this, in terms of how much further we 
want to go here.

thanks for the detailed feedback! 

Original comment by jamesh...@google.com on 15 Dec 2010 at 8:45

GoogleCodeExporter commented 8 years ago
Regarding quotation marks: they are only troublesome in delimiter-separated 
(text) files because they are used to escape the delimiters, e.g., consider

  one,"two,three",four

There are 3 cells if "ignore quotation marks" is false, and 4 cells if true. 
Quotation marks are not an issue at all in other formats. For example, Excel 
files already have cells well separated.

Regarding encoding: I think we need it for all except binary formats, like 
Excel.

Regarding the fixed length, JSON, and XML formats: do we need the 2 step 
wizards? Or can record selection or column selection be done in any order with 
respect to setting the other options? It might be easier for implementation to 
not have the wizards.

Will there be a way to select which file(s) inside an archive file to import?

Original comment by dfhu...@gmail.com on 24 Dec 2010 at 6:31

GoogleCodeExporter commented 8 years ago
Also, if the data is pasted from the clipboard and then sent in the HTTP POST 
body, then encoding is not an issue.

Original comment by dfhu...@gmail.com on 24 Dec 2010 at 8:44

GoogleCodeExporter commented 8 years ago
Awesome feature. Are there any plans to release it? 

Original comment by techtonik@gmail.com on 16 Apr 2011 at 4:00

GoogleCodeExporter commented 8 years ago
This is being worked on. No definite release date.

Original comment by dfhu...@gmail.com on 17 Apr 2011 at 3:40

GoogleCodeExporter commented 8 years ago
Revised and implemented, to be in 2.5.

Original comment by dfhu...@gmail.com on 1 Sep 2011 at 6:38

GoogleCodeExporter commented 8 years ago

Original comment by dfhu...@google.com on 9 Oct 2011 at 5:14