hpcc-systems / EDA

EDA project
1 stars 1 forks source link

FREQ code generation method - scaling issues with high column count #18

Closed dabayliss closed 10 years ago

dabayliss commented 11 years ago

The current method of generating FREQ reads the file once for each column 'FREQd'. For small files or low column counts that is no big deal; if we ever want to go to more industrial sizes then you need to: a) Blow the file up into one record per field b) Aggregate all columns at once

FWIW: SALT < 1.6 did it the way you are doing it - the vehicle file takes about 24 hours this way :( The new way takes < 30 minutes

jchambers-ln commented 11 years ago

Thanks David that is a worth while change in efficiency -- I'm assigning it to Keshav (the ecl developer on this) to take a look at refining it. Keshav let me know if you need more details.

dabayliss commented 11 years ago

If you look at the univariate logic:

OutDS := NORMALIZE(MyDS,5, TRANSFORM(NumField,SELF.id:=LEFT.uid,SELF.number:=COUNTER;SELF.value:=CHOOSE

is essentially what you need ... (not the uid part)

keshavshrikant commented 11 years ago

Do you mean the following:

OutDS := NORMALIZE(DS, 3, TRANSFORM(NumField, SELF.field:=CHOOSE(COUNTER,'firstname','lastname','zip'); SELF.value:=CHOOSE(COUNTER,LEFT.firstname,LEFT.lastname,LEFT.zip)));

FreqRec := RECORD OutDS.field; OutDS.value; INTEGER frequency := COUNT(GROUP); END;

FreqTable := TABLE(OutDs,FreqRec,field,value,MERGE); OUTPUT for each unique field

dabayliss commented 11 years ago

Yes.

Note - during the output you will have to recast back to the original field type to get the collation correct

sreekanthmenon commented 10 years ago

rectified and uploaded. Let us know if there are any other issues.