usc-isi-i2 / dig-etl-engine

Download DIG to run on your laptop or server.
http://usc-isi-i2.github.io/dig/
MIT License
101 stars 39 forks source link

ETL engine runs ETK multiple times for a document in dev mode #192

Closed majidghgol closed 6 years ago

majidghgol commented 6 years ago

@GreatYYX I have one document in my project in mydig, and looking at the ETK log, it seems that ETK is called multiple times (it is called in a loop over and over):

processing xx
LOG: xx,TOTAL,TOTAL,50.1288118362
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,52.0306711197
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,50.8498630524
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,57.9583759308
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,55.2287499905
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,56.1295881271
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,53.5118689537
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,49.9830069542
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,47.5843110085
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,52.8449418545
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,48.4429159164
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,57.7824909687
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,58.7444729805
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,52.4612529278
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,57.1368191242
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,50.5731852055
correct version2!
detected nested_docs
done
processing xx
LOG: xx,TOTAL,TOTAL,49.0617201328
correct version2!
detected nested_docs
done
processing xx

Note that in the ETK output above, "correct version2!" means that my local ETK is being run. Moreover, although nested documents are present, I have commented out the piece for nested docs in run_core_kafka, so the documents being processed are not nested docs (this can be also observed by the printed doc_id "xx" which I have assigned to the original document, the nested documents have different doc_id's).

GreatYYX commented 6 years ago

It should be Zookeeper & Kafka issue, because Kafka fails to commit offset to coordinator and consumers get the same data again and again. Have you solved this problem by freezing more memory or tuning Kafka config? There's also another way to commit offset (not implemented in myDIG): commit manually and yield consumers from getting message until coordinator receiving last commit.

majidghgol commented 6 years ago

@GreatYYX I added a line consumer.commit() in run_core_kafka after getting the message from the consumer, and turned off auto commit in ETL config. It solved this issue for me.

GreatYYX commented 6 years ago

OK, thx.

GreatYYX commented 6 years ago

Closing this issue. ETK consumer forced commits offset in the latest version.