Currently, when data is collected from an API call, it is stored in memory.
Once the collection is complete, all of the data stored in memory would be downloaded into disk (JSON).
For example
Start process to collect 5000 messages in total...
Collect 500 messages per API call...
-> collect 500 messages (totaling 500)
-> collect 500 messages (totaling 1000)
-> collect 500 messages (Totaling 1500)
-> collection fails (various reasons: RAM is full, network error, API over usage, etc.)
-> the 1500 messages that were collected and stored in memory would be lost
Change this design so that for every API call, data is directly placed into disk. This will reduce performance due to constant usage of file I/O operations, but increase reliability, because if the data collection fails, all of data that was collected in the current collection process would not be lost.
For example
Start process to collect 5000 messages in total...
Collect 500 messages per API call...
-> collect 500 messages (totaling 500)
-> collect 500 messages (totaling 1000)
-> collect 500 messages (Totaling 1500)
-> collection fails (various reasons: RAM is full, network error, API over usage, etc.)
-> the 1500 messages that were collected would have already been downloaded into disk (JSON)
-> however, do not update the offset value in the database, so that in the next collection, we start again. Yes there is duplicates, but that's okay. If Elasticsearch sees a record whose ID already exists, it will just perform an update operation
We can either make this change permanent, or we can allow the user to specify what they want in the CLI arguments.
Currently, when data is collected from an API call, it is stored in memory. Once the collection is complete, all of the data stored in memory would be downloaded into disk (JSON).
For example
Change this design so that for every API call, data is directly placed into disk. This will reduce performance due to constant usage of file I/O operations, but increase reliability, because if the data collection fails, all of data that was collected in the current collection process would not be lost.
For example
We can either make this change permanent, or we can allow the user to specify what they want in the CLI arguments.