qcif / data-curator

Data Curator - share usable open data
MIT License
264 stars 38 forks source link

Ability to handle large data files #989

Open pripley123 opened 4 years ago

pripley123 commented 4 years ago

As a data provider I want the ability to add metadata for my large data files (10+ million rows of data) so that I can create metadata for large datasets (currently Data Curator seems to slow down with greater than 10K rows of data on my laptop)

Possible approach might be to take a sample of the data which could be used with "Guess" feature to generate initial metadata which could then be modified by the user as desired. Upon exporting a data package it would be ideal for the full dataset to be included in the package.

Windows 10 (64 bit) 8 GB Ram 2.3Ghz

ghost commented 4 years ago

Hi @pripley123 Yes you're right - as much as we did work on getting as efficient as possible at the time for certain dataset sizes, there will be limitations with much larger datasets. One idea I had was similar I think to what you're suggesting whereby the amount of data iterated/streamed is tied to how much can actually be shown at any time - so that it is 'lazily loaded' as needed - we can do the same for 'Guess', as you've suggested, just using a sample. Not quite sure at this stage how much effort involved for something like this, but I'll certainly check with sponsor to see if they have had similar need for this with their typical data sets. The amount of testing/benchmarking for this alone however probably means, unless our sponsor also has this need, it will not make it into this upcoming release - apologies.

ghost commented 4 years ago

Sorry @pripley123 Don't think we're going to have the bandwidth this release to get this one in due to the work involved. It is an important issue though (especially as datasets keep getting larger) and it is something that I've discussed with our sponsors as a worthy look-in for future development.