Download and recover partially collected data

Currently, when data is collected from an API call, it is stored in memory. Once the collection is complete, all of the data stored in memory would be downloaded into disk (JSON).

For example

Start process to collect 5000 messages in total...
Collect 500 messages per API call...

-> collect 500 messages (totaling 500)
-> collect 500 messages (totaling 1000)
-> collect 500 messages (Totaling 1500)
-> collection fails (various reasons: RAM is full, network error, API over usage, etc.)
    -> the 1500 messages that were collected and stored in memory would be lost

Change this design so that for every API call, data is directly placed into disk. This will reduce performance due to constant usage of file I/O operations, but increase reliability, because if the data collection fails, all of data that was collected in the current collection process would not be lost.

For example

Start process to collect 5000 messages in total...
Collect 500 messages per API call...

-> collect 500 messages (totaling 500)
-> collect 500 messages (totaling 1000)
-> collect 500 messages (Totaling 1500)
-> collection fails (various reasons: RAM is full, network error, API over usage, etc.)
    -> the 1500 messages that were collected would have already been downloaded into disk (JSON)
    -> however, do not update the offset value in the database, so that in the next collection, we start again. Yes there is duplicates, but that's okay. If Elasticsearch sees a record whose ID already exists, it will just perform an update operation

We can either make this change permanent, or we can allow the user to specify what they want in the CLI arguments.

python scrape.py --optimize-performance

kienmarkdo / Telegram-OSINT-for-Cyber-Threat-Intelligence-Analysis

Download and recover partially collected data #33