qcif / data-curator

Data Curator - share usable open data
MIT License
264 stars 38 forks source link

Poor performance when validating foreign key relationship on large dataset #805

Closed louisjasek closed 6 years ago

louisjasek commented 6 years ago

Current Behaviour (for problems)

When I validate a foreign key relationship on a large dataset, Data Curator stalls. This happens on Windows and Mac

Expected Behaviour

When I validate a foreign key relationship on a large dataset, Data Curator should handle it.

Steps to Reproduce

  1. Open Data Curator using the icon
  2. Download this CSV and remove the first row: https://data.qld.gov.au/dataset/coastal-data-system-near-real-time-storm-tide-data/resource/7afe7233-fae0-4024-bc98-3a72f05675bd
  3. Open the CSV in data curator
  4. Create a new tab with one column called 'H1'. Add values 'abellpoint' and 'weipanx'
  5. Save this new tab as a comma separated csv
  6. Guess columns and fill in mandatory table and package values
  7. On the coastal data table, establish a foreign key relationship between 'site' and 'h1' in the new tab you created/saved.
  8. Click validate
  9. Data Curator stalls

Your Environment

Data Curator version: 0.17.0 Operating System and version: Windows 10 Pro

AND

Data Curator version: 0.17.0 MacOS Sierra 10.12.6

louisjasek commented 6 years ago

@Stephen-Gates again would you mind please adding priority and milestone. This is another high one for me. Would you also be able to test on your machine to see if you are getting the same result?

louisjasek commented 6 years ago

Here is another csv file that is causing this feature to stall Data Curator: https://data.qld.gov.au/dataset/crash-data-from-queensland-roads/resource/177dc50c-0cf7-46ba-8a69-99695aeaa46a

I created foreign keys on gender and severity columns, referencing new reference tables I created using a subset of unique values from these columns.

Stephen-Gates commented 6 years ago

Repeated behaviour with tide data :-(

Stephen-Gates commented 6 years ago

Explore upgrading to the latest version of HandsOnTable

ghost commented 6 years ago

There is some delay due to number of rows to iterate and errors to send back, but we can use a loader here. However the frictionless iterations and Data Curator callbacks are all working fine. The stalling is due to the handsontable comments plugin not handling the large amount of feedback. Will look at streaming/throttling feedback to plugin and will consider upgrading if time permits.

Stephen-Gates commented 6 years ago

Thanks for the update @mattRedBox I remember you said upgrading handsontable was significant https://github.com/ODIQueensland/data-curator/pull/678

ghost commented 6 years ago

Have upgraded to 3.0.0 and tests are running correctly. Just FYI, from what I can see in the commits to hot, the significant changes for us were:

Unfortunately, from what I can see in the hot (Handsontable) library, even the latest version does not allow more than a single cell to register a hot comment (which we use to put those little marks in each cell to show validation errors) - which is a pity as they have changed the 'selection' code to enable multiple cell selections and reselections.

With all the other parts in place (streaming errors as they are received to cell so user has immediate feedback - just need to add a loading blockout - should be straightforward), I'll look at pinching the css from handsontable and implementing ourselves so that we can do 'batch' updates of cells, rather than 1 at a time - if successful, this should provide significant improvement in the wait times.

ghost commented 6 years ago

Ok think I found a compromise within handsontable, similar to what I did in Find/Replace, where I can update array of cell objects, this time adding cell property change - then just need to render the entire hot, bypassing the specific setComment plugin call for an individual cell altogether.

Stephen-Gates commented 6 years ago

I repeated the first example in this issue. The validation completed and speed was acceptable.

screenshot 2018-06-07 06 25 39 screenshot 2018-06-07 06 22 17

Although the errors show in all cells using the triangle, some data doesn't show until you start scrolling.

The data being displayed in the cells is not correct. Only the first 1044 rows are for ablepoint.

screenshot 2018-06-07 06 33 31
Stephen-Gates commented 6 years ago

Another example of the display error after opening the above data from a data package and then validating

screenshot 2018-06-07 07 18 02
ghost commented 6 years ago

Hi @Stephen-Gates Just needed to add back the default renderer to handsontable function we update to show the error colour in cell - will do another release tonight.

Stephen-Gates commented 6 years ago

Confirmed working

screenshot 2018-06-16 07 35 37