rezarezash / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Fetching URLs from Web Services instability #413

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Using extensibly the Fetching URLs from WebServices functionality I have found 
many problems: incomplete process, aborted operation, etc. 

What steps will reproduce the problem?
1. Use the "Add Column by fetching URLs..." with 2,000 rows.
2. Invoke any Json API using "value" for input, name the New Column and set 
Throttle Delay to 2000.
3. Wait for the process to complete ...

What is the expected output? What do you see instead?
The more common scenarios are:

1) In the process of invoking the Web Services for +2,000 rows, the progress 
window suddenly closes, and the new Column isn't even created. All the progress 
is lost (this indeed a big issue when the invokes are calling a paid service).
2) Sometimes the process is completed, but the new Column doesn't show up, a 
restart for Refine is needed to be able to display the new Column.
3) Sometimes the column is created, but all the cells are empty.
4) Sometimes the process just hangs up, and you need to Cancel the operation 
(and lost all the progress)

At least the possibility to keep all the progress (before the crash) is needed, 
so you don't lost all the rows that have been invoked. (Maybe this will need a 
two-step approach, firs creating the Column and then invoking the Web Service)

What version of Google Refine are you using?
I have tried with 2.0 and 2.1 RC1

What operating system and browser are you using?
Mac OSX, Chrome

Is this problem specific to the type of browser you're using or it happens
in all the browsers you tried?
I have tried Firefox 4.x and Chrome 11.x, same behavior.

Please provide any additional information below.

Using this feature in 1,500 rows or less work great. I have tried setting the 
Throttle delay to 2000, the default (5000) or longer, sometimes improves the 
probability that the process completes, but sometimes even with this values the 
process doesn't work.

Original issue reported on code.google.com by xco...@gmail.com on 21 Jun 2011 at 7:29

GoogleCodeExporter commented 9 years ago

Original comment by dfhu...@gmail.com on 1 Jul 2011 at 3:04

GoogleCodeExporter commented 9 years ago
We've fixed a bug (issue 440) which would cause the project to get purged from 
memory (and then immediately reloaded), interrupting the fetching of URLs if it 
ran longer than one hour, which may have been part of your problem.

It's an inherent artifact of the current implementation that the actual Add 
Column doesn't happen until all the data has been collected.  Because of this, 
you should plan to operate on data sets which are small enough that you have a 
high probability of completing the data collection without being interrupted.  
With the bug fix in place, we've had reports of users running fetches for over 
24 hours successfully, so as long as your infrastructure and the service that 
you're hitting is reliable, there's no inherent limit in how long you can run.

We'll be producing a release candidate containing this bug fix shortly, so 
please retest when that's available.

Original comment by tfmorris on 8 Oct 2011 at 8:05

GoogleCodeExporter commented 9 years ago
Refine 2.5 RC1 is available here: 
http://code.google.com/p/google-refine/downloads/list?can=1

Let us know if it fixes your problem.

Original comment by tfmorris on 29 Oct 2011 at 8:17