oysterprotocol / brokernode

Server that connects to the Tangle to enable the Oyster Protocol
36 stars 9 forks source link

[Brokernode] "busy buffer" error on brokernode #280

Open rfornea opened 6 years ago

rfornea commented 6 years ago

Getting a "busy buffer" error on the brokernode when trying to upload large files.

https://oysterprotocol.slack.com/archives/C9VKK11PA/p1525968339000472

AaronVasquez commented 6 years ago

Could it be related to this? https://github.com/go-sql-driver/mysql/issues/314

Do you have a consistent way to reproduce this?

rfornea commented 6 years ago

I reproduced it twice. The third time I tried to reproduce it, I didn't get any errors at all, but the upload simply never worked.

Basically you build your brokers without the -d flag, then do an upload that's large enough to trigger the problem. Should see the request to start the session come in, and then the POST with the chunks. Then after waiting a little while, this error pops up.

It could be related to that.

automyr commented 6 years ago

Could this actually be caused by the bulk inserts we introduced recently?

It seems the issue resides in things still existing in the buffer while we're trying to use it for something else. Considering that row inserts in mariadb have a max size allowed and that this only happens on big files, I thought that maybe what's going on is that we're trying to insert too much data at once, it doesn't work but doesn't throw an immediate error, it just waits a little before trying to insert the rest of the data while other parts of the code see the successful insert and think the buffer must be empty.

Another related possibility could be that it only inserts what fits in a single operation, the rest of the buffer is not read and the insert operation just returns a success. Then the next time we try to do anything with the buffer, it's already filled with all the unread data and it fails.

I'm just throwing out wild guesses here as I've not tested anything, but perhaps it's related to something like this.

AaronVasquez commented 6 years ago

@pzhao5 any ideas? Bulk inserts might have introduced this? Are we batching the bulk inserts into smaller batches? I wonder if its using up too much memory if the bulk insert is too large...

rfornea commented 6 years ago

To be fair, we never tried such a big upload, even before the bulk inserts were in place. This may have happened beforehand if we had used large enough files.

pzhao5 commented 6 years ago

Let me break it into chunks

On Thursday, May 10, 2018, R. L. Fornea notifications@github.com wrote:

To be fair, we never tried such a big upload, even before the bulk inserts were in place. This may have happened beforehand if we had used large enough files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oysterprotocol/brokernode/issues/280#issuecomment-388181626, or mute the thread https://github.com/notifications/unsubscribe-auth/AE9dyM90une-MLMUzVP6SmeaTapEzLIjks5txKgdgaJpZM4T6Qcm .

pzhao5 commented 6 years ago

Submitted: https://github.com/oysterprotocol/brokernode/pull/282

Can you test it again?

rfornea commented 6 years ago

I was able to do a significantly larger upload this time. But now I'm getting a different error:

app_1 | [mysql] 2018/05/12 20:17:08 packets.go:36: read tcp 172.18.0.4:59122->172.18.0.3:3306: i/o timeout db_1 | 2018-05-12 20:17:08 23687 [Warning] Aborted connection 23687 to db: 'brokernode' user: 'root' host: '172.18.0.4' (Got an error writing communication packets)

rfornea commented 6 years ago

Tried again, and got this again:

app_1 | [mysql] 2018/05/12 20:20:26 packets.go:72: read tcp 172.18.0.4:59998->172.18.0.3:3306: i/o timeout app_1 | [mysql] 2018/05/12 20:20:26 packets.go:436: busy buffer db_1 | 2018-05-12 20:20:27 24123 [Warning] Aborted connection 24123 to db: 'brokernode' user: 'root' host: '172.18.0.4' (Got an error writing communication packets)

pzhao5 commented 6 years ago

this error was too general. Is there a way to verify that DB was updated?

rfornea commented 6 years ago

I think I may have fixed this error, and another one that came up shortly after. I added a bunch of logging statements to the brokernode and figured out what line the error was coming from:

var dms []models.DataMap
        rawQuery := fmt.Sprintf("SELECT * from data_maps WHERE %s", strings.Join(sqlWhereClosures, " OR "))
        err := models.DB.RawQuery(rawQuery).All(&dms)     <-----

I think the query was just too large for very large uploads.

What I did to stop the error is did the query like this instead:

 err := models.DB.RawQuery("SELECT * from data_maps WHERE genesis_hash = ? AND chunk_idx >= ? AND chunk_idx <= ?", uploadSession.GenesisHash, minChunkIdx, maxChunkIdx).All(&dms)

In the loop just before this query, I track the max chunk idx and the min chunk idx. As far as I know there is no problem with this, because we'll always be sending the chunks to the broker in contiguous batches. I.e. within any batch there'll always be a highest idx and a lowest idx and there'll be a chunk for every idx in between.

So this seems to have fixed the busy buffer error. I've been able to do several large uploads in a row without that error.

Then I cranked it up to a bigger upload, and have run into a new error:
mysql select many: Error 1213: Deadlock found when trying to get lock; try restarting transaction This is happening here:

for i := 0; i < numOfBatchRequest; i++ {
            lower := i * SQL_BATCH_SIZE
            upper := i*SQL_BATCH_SIZE + int(math.Min(float64(remainder), SQL_BATCH_SIZE))

            sectionUpdatedDms := updatedDms[lower:upper]

            // Do an insert operation and dup by primary key.

            rawQuery := fmt.Sprintf("INSERT INTO data_maps (%s) VALUES %s ON DUPLICATE KEY UPDATE message = VALUES(message), status = VALUES(status), updated_at = VALUES(updated_at)",
                dbOperation.GetColumns(), strings.Join(sectionUpdatedDms, ","))

            err = models.DB.RawQuery(rawQuery).All(&[]models.DataMap{})    <-------

            remainder = remainder - SQL_BATCH_SIZE
            //
            //fmt.Println("in Update, 281" + fmt.Sprint(err))
            if err != nil {
                fmt.Println("in Update, 283" + fmt.Sprint(err))
                raven.CaptureError(err, nil)
                break
            }
        }

I looked into deadlocks, and apparently it's okay to just retry the query after you get a deadlock. So for the time being I did this:

err = models.DB.RawQuery(rawQuery).All(&[]models.DataMap{})
            for err != nil {
                time.Sleep(500 * time.Millisecond)
                err = models.DB.RawQuery(rawQuery).All(&[]models.DataMap{})
            }

This may not be a good long-term solution but it works for now.

pzhao5 commented 6 years ago

Interesting. Yep, that OR statement could be large. I was not aware of that chunkID could be continuous

pzhao5 commented 6 years ago

What if you update SQL_BATCH_SIZE to be 1 or 2. Will it still have any issue?

pzhao5 commented 6 years ago

So, If I understand it correct:

Large file will affect current SELECT statement. The fix is to do range. Either Large/Small fill will affect the UPDATE statement. Could play around with SQL_BATCH_SIZE

rfornea commented 6 years ago

Large file will affect current SELECT statement. The fix is to do range.

Yeah. There's probably other ways we can fix it, this is just the first thing I tried.

Either Large/Small fill will affect the UPDATE statement. Could play around with SQL_BATCH_SIZE

I'll try different batch sizes and see. I think the problem is just that by the time the next update is occurring, the previous one hasn't finished, causing the deadlock.

I'll make a PR with my current changes, and we can also experiment with different batch sizes.

rfornea commented 6 years ago

Moving this to "done" for now as I think it is fixed.