Open rfornea opened 6 years ago
Could it be related to this? https://github.com/go-sql-driver/mysql/issues/314
Do you have a consistent way to reproduce this?
I reproduced it twice. The third time I tried to reproduce it, I didn't get any errors at all, but the upload simply never worked.
Basically you build your brokers without the -d flag, then do an upload that's large enough to trigger the problem. Should see the request to start the session come in, and then the POST with the chunks. Then after waiting a little while, this error pops up.
It could be related to that.
Could this actually be caused by the bulk inserts we introduced recently?
It seems the issue resides in things still existing in the buffer while we're trying to use it for something else. Considering that row inserts in mariadb have a max size allowed and that this only happens on big files, I thought that maybe what's going on is that we're trying to insert too much data at once, it doesn't work but doesn't throw an immediate error, it just waits a little before trying to insert the rest of the data while other parts of the code see the successful insert and think the buffer must be empty.
Another related possibility could be that it only inserts what fits in a single operation, the rest of the buffer is not read and the insert operation just returns a success. Then the next time we try to do anything with the buffer, it's already filled with all the unread data and it fails.
I'm just throwing out wild guesses here as I've not tested anything, but perhaps it's related to something like this.
@pzhao5 any ideas? Bulk inserts might have introduced this? Are we batching the bulk inserts into smaller batches? I wonder if its using up too much memory if the bulk insert is too large...
To be fair, we never tried such a big upload, even before the bulk inserts were in place. This may have happened beforehand if we had used large enough files.
Let me break it into chunks
On Thursday, May 10, 2018, R. L. Fornea notifications@github.com wrote:
To be fair, we never tried such a big upload, even before the bulk inserts were in place. This may have happened beforehand if we had used large enough files.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oysterprotocol/brokernode/issues/280#issuecomment-388181626, or mute the thread https://github.com/notifications/unsubscribe-auth/AE9dyM90une-MLMUzVP6SmeaTapEzLIjks5txKgdgaJpZM4T6Qcm .
Submitted: https://github.com/oysterprotocol/brokernode/pull/282
Can you test it again?
I was able to do a significantly larger upload this time. But now I'm getting a different error:
app_1 | [mysql] 2018/05/12 20:17:08 packets.go:36: read tcp 172.18.0.4:59122->172.18.0.3:3306: i/o timeout db_1 | 2018-05-12 20:17:08 23687 [Warning] Aborted connection 23687 to db: 'brokernode' user: 'root' host: '172.18.0.4' (Got an error writing communication packets)
Tried again, and got this again:
app_1 | [mysql] 2018/05/12 20:20:26 packets.go:72: read tcp 172.18.0.4:59998->172.18.0.3:3306: i/o timeout app_1 | [mysql] 2018/05/12 20:20:26 packets.go:436: busy buffer db_1 | 2018-05-12 20:20:27 24123 [Warning] Aborted connection 24123 to db: 'brokernode' user: 'root' host: '172.18.0.4' (Got an error writing communication packets)
this error was too general. Is there a way to verify that DB was updated?
I think I may have fixed this error, and another one that came up shortly after. I added a bunch of logging statements to the brokernode and figured out what line the error was coming from:
var dms []models.DataMap
rawQuery := fmt.Sprintf("SELECT * from data_maps WHERE %s", strings.Join(sqlWhereClosures, " OR "))
err := models.DB.RawQuery(rawQuery).All(&dms) <-----
I think the query was just too large for very large uploads.
What I did to stop the error is did the query like this instead:
err := models.DB.RawQuery("SELECT * from data_maps WHERE genesis_hash = ? AND chunk_idx >= ? AND chunk_idx <= ?", uploadSession.GenesisHash, minChunkIdx, maxChunkIdx).All(&dms)
In the loop just before this query, I track the max chunk idx and the min chunk idx. As far as I know there is no problem with this, because we'll always be sending the chunks to the broker in contiguous batches. I.e. within any batch there'll always be a highest idx and a lowest idx and there'll be a chunk for every idx in between.
So this seems to have fixed the busy buffer error. I've been able to do several large uploads in a row without that error.
Then I cranked it up to a bigger upload, and have run into a new error:
mysql select many: Error 1213: Deadlock found when trying to get lock; try restarting transaction
This is happening here:
for i := 0; i < numOfBatchRequest; i++ {
lower := i * SQL_BATCH_SIZE
upper := i*SQL_BATCH_SIZE + int(math.Min(float64(remainder), SQL_BATCH_SIZE))
sectionUpdatedDms := updatedDms[lower:upper]
// Do an insert operation and dup by primary key.
rawQuery := fmt.Sprintf("INSERT INTO data_maps (%s) VALUES %s ON DUPLICATE KEY UPDATE message = VALUES(message), status = VALUES(status), updated_at = VALUES(updated_at)",
dbOperation.GetColumns(), strings.Join(sectionUpdatedDms, ","))
err = models.DB.RawQuery(rawQuery).All(&[]models.DataMap{}) <-------
remainder = remainder - SQL_BATCH_SIZE
//
//fmt.Println("in Update, 281" + fmt.Sprint(err))
if err != nil {
fmt.Println("in Update, 283" + fmt.Sprint(err))
raven.CaptureError(err, nil)
break
}
}
I looked into deadlocks, and apparently it's okay to just retry the query after you get a deadlock. So for the time being I did this:
err = models.DB.RawQuery(rawQuery).All(&[]models.DataMap{})
for err != nil {
time.Sleep(500 * time.Millisecond)
err = models.DB.RawQuery(rawQuery).All(&[]models.DataMap{})
}
This may not be a good long-term solution but it works for now.
Interesting. Yep, that OR statement could be large. I was not aware of that chunkID could be continuous
What if you update SQL_BATCH_SIZE to be 1 or 2. Will it still have any issue?
So, If I understand it correct:
Large file will affect current SELECT statement. The fix is to do range. Either Large/Small fill will affect the UPDATE statement. Could play around with SQL_BATCH_SIZE
Large file will affect current SELECT statement. The fix is to do range.
Yeah. There's probably other ways we can fix it, this is just the first thing I tried.
Either Large/Small fill will affect the UPDATE statement. Could play around with SQL_BATCH_SIZE
I'll try different batch sizes and see. I think the problem is just that by the time the next update is occurring, the previous one hasn't finished, causing the deadlock.
I'll make a PR with my current changes, and we can also experiment with different batch sizes.
Moving this to "done" for now as I think it is fixed.
Getting a "busy buffer" error on the brokernode when trying to upload large files.
https://oysterprotocol.slack.com/archives/C9VKK11PA/p1525968339000472