usc-isi-i2 / table-linker

Table Linker
MIT License
21 stars 8 forks source link

command: context-match is too slow #40

Open saggu opened 3 years ago

saggu commented 3 years ago

Running the command on the smallest file (1106 rows) takes about 8 seconds to run.

time tl context-match 1438042989018_40_20150728002309-00067-ip-10-236-191-2_57714692_2.csv --context-file 1438042989018_40_20150728002309-00067-ip-10-236-191-2_57714692_2_context.tsv  -o context_score > context_test.csv

real    0m10.968s
user    0m8.024s
sys 0m0.972s

and running it on the largest file(58446 rows) ran for more than 7 minutes and didn't finish ( I had to interrupt it)

time tl context-match 88523363_0_8180214313099580515.csv --context-file 88523363_0_8180214313099580515_context.tsv  -o context_score > context_test.csv

I changed the way context file is read and used, hoping this would fix it but didnt help much.

Most of the time is spent in computing the score and not in I/O. This needs to be optimized before it can be used.

Please take a look, we can discuss in a meeting if required.

All the files required are attached.

data.zip

HardiRathod commented 3 years ago

Replaced the iterrows() with zip() to iterate over the dataframe in order to reduce time. Largest File (58446 rows) took 3 minutes 35 seconds to complete.