ibm-watson-data-lab / simple-search-service

A faceted search engine and content API.
39 stars 27 forks source link

Make it easier to upload larger datasets (by url or compressed files) #21

Closed Rio517 closed 8 years ago

Rio517 commented 8 years ago

For larger datasets, uploading can be a pain. I uploaded a 800MB file for 2-3 hours over a poor connection outside of Berlin. I realize my use case is narrow, but I think these proposed enhancements could help others:

In my case, I wound up setting up a remote ubuntu desktop box that I uploaded a gzipped file (91MB), then uncompressed to upload in the SSS GUI.

Good luck and thanks for everyone's hard work so far!

P.S. I realized that @bradnoble and I used to work together back at Mullen in 2004 or 2005. I was a 23yo account guy back then and would be surprised if you remembered me. Glad to see you're doing well. :)

bradnoble commented 8 years ago

Hey @Rio517, I remember you. I'm not that old! Great to see you getting mileage out of the SSS. We'll be pushing updates, and we're always happy to review PRs!

glynnbird commented 8 years ago

As preparatory work for this I updated couchimport to version 0.3.0. This adds another function previewURL which we can use to get the preview of file which you have the URL of, but haven't uploaded to SSS. It loads the first 10k of the file then kills the connection (because the file could be HUGE).

couchimport.previewURL('https://s3-eu-west-1.amazonaws.com/glynnbirddotcom/hp.csv', { COUCH_DELIMITER: ","}, function(err, data) {
  console.log(err,data);  
});

null [ { id: '{0FC6F1BF-79C4-401E-9910-0000F5CC2B4A}',
    price: '195000',
    date: '2015-04-16 00:00',
    postcode: 'EN8 7EG',
    a: 'F',
    b: 'N',
    c: 'L',
    building: 'BUTLERS COURT',
    house_number: '5',
    road: 'TRINITY LANE',
    address1: '',
    address2: 'WALTHAM CROSS',
    town: 'BROXBOURNE',
    county: 'HERTFORDSHIRE',
    property_type: 'A' },
  { id: '{CB44E6D8-CD59-4CDD-AD79-0000F773874C}',
    price: '60000',
    date: '2015-04-09 00:00',
    postcode: 'S2 5FW',
    a: 'S',
    b: 'N',
    c: 'F',
    building: '1',
    house_number: '',
    road: 'HASLEHURST ROAD',
    address1: '',
    address2: 'SHEFFIELD',
    town: 'SHEFFIELD',
    county: 'SOUTH YORKSHIRE',
    property_type: 'A' },
  { id: '{B548CACA-5D17-4B6A-ADF4-0002188D07F0}',
    price: '248000',
    date: '2015-04-24 00:00',
    postcode: 'BR5 3BQ',
    a: 'S',
    b: 'N',
    c: 'F',
    building: '2',
    house_number: '',
    road: 'HORSELL ROAD',
    address1: '',
    address2: 'ORPINGTON',
    town: 'BROMLEY',
    county: 'GREATER LONDON',
    property_type: 'A' } ]

This is the same preview technology as is used in the existing SSS but that only worked for uploaded files.

glynnbird commented 8 years ago

Then we can use the pre-existing couchimport.importStream to do the actual import without downloading the whole file:

e.g

couchimport.importStream(request.get('http://s3-eu-west-1.amazonaws.com/glynnbirddotcom/hp.csv'), {COUCH_URL: 'http://localhost:5984', COUCH_DATABASE: 'mydb', COUCH_DELIMITER: ','}, function(err, data) {
  console.log(err, data);
})
bradnoble commented 8 years ago

This is good to go from our POV. @Rio517 please reopen if you find issues.