Data corruption on data base mdb

MamounENGIE commented 5 years ago

Hello, We installed the package OpenLDAP-ltb on ubuntu 16.04.2 and we have two LDAP data bases working in multi-master mode replication. We then, imported our data with "slapadd" on each node. On the first node we have all our data which was imported before with "slapdaddd" and we have 9Go in mdb.data. In the second node, we can't find all the data that was imported with "slapdadd" and still, we have the same size (9Go) in mdb.data. We cannot explain why we can't find our data in the second node anymore... Do you have any idea how is that possible ? And do you know if we can recover our data that was imported ? Thank you in advance for your response and your help.

davidcoutadeur commented 5 years ago

Hi Mamoun,

I think in this case,

either there should be grave errors in the OpenLDAP logs when replicating or importing the data,
either there is a problem with the data visibility : permissions rights on the file system? OpenLDAP ACLs? index?

Anyway, it is much too early to imput the responsability of a data corruption to the OpenLDAP-LTB package...

Regards,

David

Le 28/01/2019 12:40, MamounENGIE a écrit :

Hello, We installed the package OpenLDAP-ltb on ubuntu 16.04.2 and we have two LDAP data bases working in multi-master mode replication. We then, imported our data with "slapadd" on each node. On the first node we have all our data which was imported before with "slapdaddd" and we have 9Go in mdb.data. In the second node, we can't find all the data that was imported with "slapdadd" and still, we have the same size (9Go) in mdb.data. We cannot explain why we can't find our data in the second node anymore... Do you have any idea how is that possible ? And do you know if we can recover our data that was imported ? Thank you in advance for your response and your help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ltb-project/openldap-deb/issues/27, or mute the thread https://github.com/notifications/unsubscribe-auth/AHWPHVD-9oQDpQlf-BTVrf-btFidrVb_ks5vHuG4gaJpZM4aVvK2.

coudot commented 5 years ago

Could you check that the 9GB size is the real file size, and not the reserved size configured in olcMdbMaxsize?

MamounENGIE commented 5 years ago

Hi, I don't think it is a matter of OpenLDAP ACLs or rights on the file system because : 1- We made the search of data with slapcat and we can't find data. 2- When there is an ldap modify in an entry on the first node which doesn't exist on the second node, it appears (after the ldap modify) on the second node, so i guess the replication module create the entry on the second node. We set the parameter olcMdbMaxSize to 64Go.

Thank your for your help.

MamounENGIE commented 5 years ago

We redirect all logs of the off-line import in a file and this file is the same in both nodes.

coudot commented 5 years ago

When doing the slapadd command, is slapd stopped?

Which options are you using with slapadd command?

MamounENGIE commented 5 years ago

Hello,

Yes, when using the slapadd command, slapd is stopped, the command used is :

slapadd -l ${ldifFile} -F ${slapd.d directory} -c -d 512

coudot commented 5 years ago

First you could remove "-c" so you stop at first error. There should not be any error.

Then for a bulk import, it is better to use -q

MamounENGIE commented 5 years ago

Hello, Thank you for your response. What is the purpose of the option -q when doing bulk import?

coudot commented 5 years ago

From slapadd manpage:

       -q     enable quick (fewer integrity checks) mode.  Does fewer consistency checks
              on the input data, and no consistency checks when writing the database.
              Improves the load time but if any errors or interruptions occur the resulting
              database will be unusable.

It allows a fast data import. We use it in our OpenLDAP init script for data restore

MamounENGIE commented 5 years ago

Thank you for your response. Actually, we had the exact same logs file when we imported the data on both nodes. Errors in both nodes are the same. And the size of data-base (data.mdb) is the same, so i don't think we had a problem when importing data...

coudot commented 5 years ago

What is the error?

MamounENGIE commented 5 years ago

We resynchronized both nodes with backuping the data from the first node and restoring them in the second node and then we launch a global resynchronization with adding the option -c in the second node. Now we have the exact same entries in both nodes. Still, we have a little problem : When i launched the command slapd-cli checksync in ordre to check the replication and synchronization, I got an OK in the second node which is great but when i launched it the first node, I got a KO. Here is the result of the command in the first and the second node :

First node :

slapd-cli: [INFO] Using /usr/local/openldap/etc/openldap/slapd-cli.conf for configuration
slapd-cli: [INFO] Checking synchronization...
Checking host ldaps://xxxx.fr:636
Checking SID #001#
ERROR: remote (20190129165545.239228Z#000000#001#000000) and local (20190129164600.230776Z#000000#001#000000) contextCSN found, but not synchronized!
Checking SID #002#
INFO: remote (20190129164601.166126Z#000000#002#000000) and local (20190129164601.166126Z#000000#002#000000) contextCSN found and synchronized
Checking host ldaps://xxx.fr:636
Checking SID #001#
ERROR: remote (20190129165545.239228Z#000000#001#000000) and local (20190129164600.230776Z#000000#001#000000) contextCSN found, but not synchronized!
Checking SID #002#
INFO: remote (20190129164601.166126Z#000000#002#000000) and local (20190129164601.166126Z#000000#002#000000) contextCSN found and synchronized
slapd-cli: [KO] Local directory not synchronized to one of its declared providers

Second node :

Checking host ldaps://xxx.fr:636
Checking SID #001#
INFO: remote (20190129164724.680859Z#000000#001#000000) and local (20190129164724.680859Z#000000#001#000000) contextCSN found and synchronized
Checking SID #002#
INFO: remote (20190129164601.166126Z#000000#002#000000) and local (20190129164601.166126Z#000000#002#000000) contextCSN found and synchronized
Checking host ldaps://xxx:636
Checking SID #001#
INFO: remote (20190129164724.680859Z#000000#001#000000) and local (20190129164724.680859Z#000000#001#000000) contextCSN found and synchronized
Checking SID #002#
INFO: remote (20190129164601.166126Z#000000#002#000000) and local (20190129164601.166126Z#000000#002#000000) contextCSN found and synchronized
slapd-cli: [OK] Local directory synchronized to every declared provider

Thank you in advance for your response and your help.

coudot commented 5 years ago

Maybe the replication was not finished when checking node one? You can modify one entry on node one and then one entry on node two to force synchronization update.

mejdibennour commented 5 years ago

Hello,

The problem occured again in preproduction. We noticed yesterday that all most of the accounts have disappeared from the second node.

In the first one :

# numEntries: 2739792
Dbsize : -rw------- 1 ldap ldap 3984334848 Jan 30 12:07 data.mdb

In the second one :

# numEntries: 4365
Dbsize : -rw------- 1 ldap ldap 3982536704 Jan 30 12:07 data.mdb

The preproduction LDAP init was done the 24/01. All the accounts were imported on the 2 nodes with slapadd. We checked on the 2 nodes and the replication was working fine. We have made load tests the 25/01 and there was no issue.

The accounts on the second node have disappeared between the 25/01 and the 29/01.

Best Regards, Mejdi

coudot commented 5 years ago

Hello,

You can use mdb_stat to check to content of the mdb database. And also check that both nodes are NTP synced, as OpenLDAP replication is based on timestamps.

If you need professionnal support, check https://ltb-project.org/professionalservices. I personally works for Worteks

mejdibennour commented 5 years ago

Hello, here are the mdb_stat results :

node 1 : $ sudo mdb_stat -e /data/openldap-ltb/var/openldap-data/ Environment Info Map address: (nil) Map size: 68719476736 Page size: 4096 Max pages: 16777216 Number of pages used: 972738 Last transaction ID: 3613191 Max readers: 126 Number of readers used: 7 Status of Main DB Tree depth: 1 Branch pages: 0 Leaf pages: 1 Overflow pages: 0 Entries: 10

node 2 : $ sudo mdb_stat -e /data/openldap-ltb/var/openldap-data/ Environment Info Map address: (nil) Map size: 68719476736 Page size: 4096 Max pages: 16777216 Number of pages used: 972299 Last transaction ID: 5810558 Max readers: 126 Number of readers used: 11 Status of Main DB Tree depth: 1 Branch pages: 0 Leaf pages: 1 Overflow pages: 0 Entries: 10

coudot commented 5 years ago

So you see that the number of pages used is not the same between the 2 nodes.

mejdibennour commented 5 years ago

Hello, I have made a slapcat to count the number of entries in the db. Here is the result : Node1 : 2739899 Node2 : 4483 Lokks like it is a db issue.

coudot commented 5 years ago

Indeed, there should be a difference between your two servers. Check OpenLDAP configurations, packages versions, disk space...

davidcoutadeur commented 5 years ago

There is no evidence that the LTB packaging is responsible for any of these errors. I close the issue.

ltb-project / openldap-deb

Data corruption on data base mdb #27