With the call-block-logs-extraction I did on the rugpullindex.com Erigon node, we end up with large files (100GB) where the file can be corrupt if e.g. the processed crashed during crawling (which happened a couple of times).
Here is one such instance:
SyntaxError: JSON parsing error: "00000000000000000000000025a45a27339c800","blockNumber":"0xb155f3","transactionHash":"0xc8fcdb0733561cb9665c5bfc14992f0d98f3073ec83c9daf304a9fec291ac8f6","transactionIndex":
"0x2c","blockHash":"0x0a635c6e7202324e91dbe7a35af899bc559723eb460ce95940860b9dd08737d8","logIndex":"0x1e","removed":false},{"address":"0xbd356a39bff2cada8e9248532dd879147221cf76","topics":["0xddf252ad1be2c8
9b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef","0x0000000000000000000000008a7bf3754bfea3ed2be091bd163ad9633fd6d6ba","0x000000000000000000000000581ab96ccbd41eeed17050c5bd9a4d3e5139ee29"],"data":"0x00000
0000000000000000000000000000000000000000002b5e3af16b1880000","blockNumber":"0xb155f3","transactionHash"
:"0x8a96e8285d755b9418f2d072a46f3cfa8c8ac96c3eb94d8adeb2ace3df6b0516","transactionIndex":"0x31","blockH
ash":"0x0a635c6e7202324e91dbe7a35af899bc559723eb460ce95940860b9dd08737d8","logIndex":"0x1f","removed":f
alse},{"address":"0xaaaebe6fe48e54f431b0c390cfaf0b017d09d42d","topics":["0xddf252ad1be2c89b69c2b068fc37
8daa9", "SyntaxError: Unexpected number in JSON at position 1
It's very painful to manually error correct theses files. They cannot be edited in place so the temporary file will use a similar amount of disk space
And then if we had used a data base with atomic transactions, it probably would have prevented corruption of the file's data
The only real requirements we have for a data base are:
can write really fast
is embedded and hence only produces an output that represents the database (e.g. sqlite or leveldb)
Keep in mind that we're at this stage not even yet interested in storing structured data. We just want to safely write a lot of data to disk and make sure the file doesn't corrupt. So e.g. building a relational database table for the JSON schema you're seeing above would be a mistake at the stage of extraction IMO.
Here is one such instance: