thinker007 / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Allow "Fetch URL" to modify/add to existing column rather than only creating new column #120

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Use the "Add Column by Fetching URLs..." feature to add a new column
2. On a large or complex dataset, or if you choose too low a value for the 
"Throttle delay" there are many cases where some cells in the newly created 
column are blank or incorrect.
3. There is no way to fetch URLs against an existing column (i.e. the 
"Transform" cells feature).  You always have to create a new column (sometimes 
several) and then manually transfer the results into the existing one.

What is the expected output? What do you see instead?

You could add the "Fetch URLs" to the "Transform" cells but that would be 
clunky.  Allowing the "Fetch URLs" feature to be used on an existing column is 
a good approach.

Of course you then have the question of what to do when the cell in the 
existing column isn't empty -- do you overwrite or not?  I think the choices 
are "Fetch URL for empty cells only" / "Overwrite existing cell contents" -- if 
you don't have to fetch the URL in the first place, it would speed things up 
considerably on a large dataset. (You could have a simple "Overwrite" checkbox 
which is really what it would be under the covers, I think, but the two states 
of the boolean are pretty different from each other, which is why I suggest 
framing it as two distinct choices.

What version of the product are you using? On what operating system?

Trunk version of Gridworks on Windows 7.

Original issue reported on code.google.com by bil...@gmail.com on 2 Sep 2010 at 6:24

GoogleCodeExporter commented 9 years ago

Original comment by iainsproat on 14 Oct 2010 at 4:33

GoogleCodeExporter commented 9 years ago

Original comment by tfmorris on 12 Dec 2010 at 7:52

GoogleCodeExporter commented 9 years ago
Another solution to this problem would be to make the operation 
restartable/continuable so that Refine keeps track of which cells have been 
successfully fetched.

This wouldn't take care of the use case where you wanted to update existing 
values, but it would take care of the error case.

Original comment by tfmorris on 26 Jan 2012 at 6:42