leo-project / leofs

The LeoFS Storage System
https://leo-project.net/leofs/
Apache License 2.0
1.56k stars 155 forks source link

Objects restored as part of recover-cluster do not appear to include headers when retrieved #1120

Closed asphytotalxtc closed 6 years ago

asphytotalxtc commented 6 years ago

Hey guys, wonder if anyone could give me some pointers on an issue I'm seeing using recover-cluster. We've simulated a DR cluster going unavailable (by just shutting it all down) and when bringing it up again and doing a recover-cluster on the master (with the cluster name of the now available DR cluster) things seem to work, objects created on the primary whilst the second cluster was down appear on the secondary as they should. I can browse them, a leofs-adm whereis shows the right size and checksum, but any attempt to retrieve those objects results in an broken response from LeoFS. All I can see in the gateway error log is

"[E] gateway_0@leo-dc2.local 2018-09-07 08:56:24.84394 +0100 1536306984 null:null 0 Bad value on output port 'tcp_inet'"

Objects replicated before the cluster went down are fine, objects newly replicated after cluster recovery are fine too, it's just the objects created and replicated as part of recover-cluster that exhibit this issue.

Investigating a little further, it appears that these objects recovered as part of recover-cluster are not getting served up with any headers, I've attached a wireshark trace showing a request of the same object from both clusters

2018-09-07-105953_2560x1440_scrot

I'm at a complete loss so any advice would be greatly appreciated! :)

asphytotalxtc commented 6 years ago

Just an update, I'm seeing this happen with freshly replicated objects too. Neither (s3) HEAD nor GET methods are including any headers.

mocchira commented 6 years ago

@asphytotalxtc Thanks for reporting this issue.

Can I ask you what version of LeoFS are you using? (IIRC, some older version have the issue which symptom was very similar to yours. so if you don't use the latest, Bumping up to the latest (1.4.2) may solve your problem.

asphytotalxtc commented 6 years ago

@mocchira I'm using the latest 1.4.2 version, I actually read back up on some of the previous issues trying to resolve the issue. I'm honestly at a loss!

Just to expand on that, this is just a development environment at the moment. I'm running two VM's, each running a master manager, slave manager, single gateway and two storage nodes. Pretty much all of the config is as default as possible and I'm using the latest Ubuntu 18.04 packages.

As I said, I'm at a complete loss šŸ˜‚ but any pointers on where to look, or if I can provide you any debug information to help diagnose anything, just let me know :)

Edit: Also, if it's of any difference, I'm using "path style" buckets rather than dns style.. not sure what effect that would have but thought it was worth mentioning.

mocchira commented 6 years ago

@asphytotalxtc Thanks for the quick reply. I'm going to try to reproduce the issue on as much similar env to yours as possible. I'll get back to you once I find the root problem.

asphytotalxtc commented 6 years ago

@mocchira I believe I've just had a bit of a breakthrough! purely out of luck noticing a packet trace too..

I've noticed that all the objects that replicate perfectly, all have custom metadata in the PUT request. s3client, CloudBerry etc all seem to set custom client specific metadata along with their put requests.

Our custom internal application based on nodejs and the amazon aws s3 sdk doesn't set any custom metadata at all when putting objects, it's these objects without custom meta data that appear to not replicate properly with the header information. Modifying our application to set a x-amz-meta-xxxxxxx header results in objects replicating between the clusters without issue.

Perhaps something to do with #835 could have caused this maybe? (pure guess there šŸ˜‚)

mocchira commented 6 years ago

@asphytotalxtc Great catch.

Perhaps something to do with #835 could have caused this maybe? (pure guess there šŸ˜‚)

Yes I think so! I'll look into around.

asphytotalxtc commented 6 years ago

Let me know if I can be of any further help šŸ‘Œ

mocchira commented 6 years ago

WIP. Successfully reproduced with s3cmd/LeoFS 1.4.2.

mocchira commented 6 years ago

Cause

It turns out that leo_object_storage_transformer:get_udm_from_cmeta_bin has wrongly been applied to every object twice to be sent to a remote cluster by recover-cluster.

Solution

After https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/src/leo_storage_mq.erl#L1053 line,

Object#?OBJECT.meta = term_to_binary([{?PROP_CMETA_UDM, binary_to_term(Metadata.meta)}])

Add the above code will solve the problem.

@asphytotalxtc The fix will probably be included in the next minor release 1.4.3 so please wait for a while.

asphytotalxtc commented 6 years ago

@mocchira fantastic! Glad you managed to find the root cause :) We'll be upgrading to the latest release before going to production with this but since we were going to make use of custom meta data (eventually anyway, this was just a proof of concept) it's certainly not a showstopper for us.

Thank you guys for not only the excellent support, but also for this fantastic object storage system. Great work!