Closed JeffLowe closed 2 years ago
@dchoi27 This is the discovery we brought up in this morning's niftysave triage meeting today. Description should have the very high level context needed for cross-team understanding. Just flagging this for Nitro team awareness per agenda.
We measured current fill script speed this morning, and its running too slow (would take roughly 1 year to fill gaps).
How does was this measured ? Estimate seems surprising because previous crawl took a month at most if we consider all the times we had to stop it. Granted it is missing ~50% of nfts, but even than I would expect it take 2x the previous crawl which would be more like 2 month.
Few things to consider:
Since https://github.com/ipfs-shipyard/nft.storage/pull/601 we walk the sub(graph) by mintTime
. If current estimate is accurate we can slice up time from first token to now in 12 pieces and run parallel scrapers that would pull things that were minted in specific time range. Or more generally if we want to be done in X hours we can slice up the time into X hour ranges and run as many parallel scrapes as will take (although number of nfts are definitive not evenly distributed so our estimates are going to be inaccurate).
I would argue that we should finish up all the other migration tasks e.g. #597 and #595, #613 etc... before rushing to pull all the nfts into the db. More data is only going to make this more difficult to do later. Especially schema changes.
Replicating data from the (sub)garph isn't really all that useful unless we actually fetch and archive all the content it refers to & I would argue making improvements there should be a higher priority as our success rate there is not good.
Here's a few things that can affect the speed
I think we need to measure ingestion speed (so not just need records but total processed)
I'm going to propose I make a data clip or some readout that is 'records per minute' vs 'new records per minute'. there's a few ways to make this, based on our discussion. I think this would also help us project future performance.
From Slack (@Gozala posted): "...dataclip with some numbers on how slow niftysave is going how much it indexed, reindexed and how much is remaining" https://data.heroku.com/dataclips/cmlmjwvsbiohfczrpzosolgsnkgl
I put together a dataclip to provide some numbers https://data.heroku.com/dataclips/cmlmjwvsbiohfczrpzosolgsnkgl
With current numbers it appears that with current speed it would take almost 10month to complete a rescan. Please note that this does not mean it would take this much to fill the gaps, it means in worst case scenario it would take this long.
According to this stats we write around 86 NFTs per minute, so just barely more than an NFT per second. While I do not know what the average write speed was before, I think it is reasonable to assume that it was much faster. I think fauna aborts queries that take over 2mins and our batch writes were pushing around 1000 nfts per write so in worst case scenario speed would have been 500 nfts per minute. With that speed rescan would have taken less than 2 month which matches with what it actually took.
From the above I would guess that switch from batched writes to one at a time write had a very negative impact on throughput. Which is why I would suggest trying following:
While chances are that postgres is our bottleneck, in which case batching will not help, I highly doubt that. If hasura is rate limiting us, which might be the case, batching should improve our throughput. General network overhead would also be reduced by batching.
As per https://github.com/ipfs-shipyard/nft.storage/issues/618#issuecomment-944624427 we can also time slice to parallelize scan and speed things up even more.
I would also like to call out that we can backfill gaps a lot faster too because we can infer where the gaps are in our dataset and we could use that to scan only gaps. That said I think it would be better to improve overall throughput so that we can rescan the whole chain in weeks not months.
@the-simian after trying batched writes I would suggest utilizing concurrency
config used by other jobs
https://github.com/ipfs-shipyard/nft.storage/blob/7efc57f6313ec698d3c53f0c5bcbf021dc79a3f9/packages/niftysave/src/analyze.js#L172-L181
Which we could use to slice time into config.concurrency
segments and run parallel tasks operating on those segments. This would allow us to tweak batch size and concurrency until we hit postgres write limits.
Quick update: not the best resolution (because only some time has elapsed), but the ongoing estimate has gone from 10 to about 4.5 months (Edit: This number is decreasing, and is currently at 3.12 now it is at 2.5, which I think seems correct). It is safe to say the batches alone have made a huge improvement and there's probably more we can do to speed things up as well
references #634 , which is tasked seperately
From the work on #556 we discovered the folllowing:
There should be a total of 34 million records in niftysave. We currently have 18.
We are running a gap fill scrape of blockchain to fill this gap. We measured current fill script speed this morning, and its running too slow (would take roughly 1 year to fill gaps). We need to determine a way to speed this gap filling up.
Thus, we will spend time investigating ways to speed up the gap fill