Duplicated / Missing index entries

marcosmro commented 3 years ago

I've noticed that some CDE fields are duplicated in our Elasticsearch index (e.g., @id = https://repo.metadatacenter.org/template-fields/01a811c6-cf28-4a47-bccf-727191c5602a), while others are missing. I'm trying to find the cause of this issue.

marcosmro commented 3 years ago

It seems that the issue with missing entries was caused by the disk running low on space. When that happens, Elasticsearch puts itself into read-only mode and CEDAR throws the following message:

WARN  [2021-06-15 10:13:52,866] org.metadatacenter.cedar.util.dw.CedarCedarExceptionMapper: :CCEM:msg :blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

marcosmro commented 3 years ago

We fixed the missing entries issue by truncating the log_cypher table and disabling the read_only_allow_delete flag in Elasticsearch as follows:

PUT /_all/_settings
{
    "index.blocks.read_only_allow_delete": null
}

Our short-term plan to address the disk space issue is described at https://github.com/metadatacenter/cedar-project/issues/1134

marcosmro commented 3 years ago

Here is a brief description of the issue that caused duplicated index entries and the approach used to fix it:

The caDSR CDEs ingestion tool makes use of two different endpoints to upload CDEs to CEDAR:

E1: Create a CDE. It creates a CDE in MongoDB, Neo4j, and the associated Elasticsearch index entry.
E2: Attach the CDE created in E1 to one or several categories. It creates the relationships between the CDE and the categories in Neo4j, and updates the existing Elasticsearch index entry created in E1 to include the corresponding category identifiers.

This issue was related to the index update performed in E2. Now, focusing on the Elasticsearch index, here are the actions taken to create a CDE:

a. Create a CDE index entry (without any associated categories) (done by E1). b. Find the existing index entry for the CDE (search by CEDAR id) (done by E2). c. Delete the existing index entry for the CDE using the index identifier (done by E2). d. Create a new index entry for the CDE, including the category identifiers (done by E2).

The index ends up with duplicate index entries for CDEs when there is not enough time between a) and b). These steps happen sequentially but, in Elasticsearch, there is a delay (by default, 1 second) from the time a document is created until it’s visible (searchable). Therefore, when b) happens in less than 1 second from a), the index entry associated with the CDE won’t be found and won’t be deleted. Consequently, after E2, the index will contain two indexed documents for a given CDE: one without the categories and one with the categories.

There are ways in Elasticsearch both to force an index refresh and decrease the refresh interval, but refreshing an index takes up considerable resources, and taking those actions is not recommended.

The approach used to solve this issue is twofold:

Create in batches: Instead of creating CDEs one by one and immediately associating categories to them, now they are processed in fixed-size batches (100 CDEs/batch). Given that creating 100 CDEs will always take more than 1 second, we are sure that the CDEs are findable (and removable) in the index before associating them to their categories.
Retry to delete. To ensure that CDEs never remain undeleted, which is possible in rare cases, such as when the number of CDEs leaves a final batch with size 1 (e.g., if the total number of CDEs is 301, the batches will be 100, 100, 100, 1), we wait and retry the deletion several times.

An obvious alternative to the described approach would be to develop a new endpoint that does all the work in E1 and E2 in just one call, that is, creates the CDE, associates the CDEs to categories, and creates the corresponding index entry once, at the end of the process.

metadatacenter / cedar-project

Duplicated / Missing index entries #1133